CN103020083B - The automatic mining method of demand recognition template, demand recognition methods and corresponding device - Google Patents
The automatic mining method of demand recognition template, demand recognition methods and corresponding device Download PDFInfo
- Publication number
- CN103020083B CN103020083B CN201110286986.XA CN201110286986A CN103020083B CN 103020083 B CN103020083 B CN 103020083B CN 201110286986 A CN201110286986 A CN 201110286986A CN 103020083 B CN103020083 B CN 103020083B
- Authority
- CN
- China
- Prior art keywords
- query
- template
- preset
- requirement
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005065 mining Methods 0.000 title claims abstract description 17
- 238000002372 labelling Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000007115 recruitment Effects 0.000 description 23
- 238000010586 diagram Methods 0.000 description 6
- 230000008520 organization Effects 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000009412 basement excavation Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of automatic mining method of demand recognition template, demand recognition methods and corresponding device, determines the query set that the webpage of preset kind is corresponding time clicked in search daily record; The total number of times selecting the webpage of corresponding preset kind clicked gathering from query exceedes preset times threshold value, and/or, the webpage click of corresponding preset kind is preset than exceeding and is clicked the query than threshold value, using the seed query of the query of selection as preset kind; Each seed query is mated with the dictionary of preset kind respectively, after the word matching dictionary is replaced to the attribute flags of corresponding word in dictionary, obtains the template set of preset kind in seed query; The template set of preset kind is utilized to determine the demand recognition template of preset kind. Manpower can be saved by the present invention, expand search and identify the query scope that can cover, it is to increase recall rate.
Description
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computers, in particular to an automatic mining method of a demand identification template, a demand identification method and a corresponding device.
[ background of the invention ]
With the rapid development and maturity of the internet in the global scope, the information resources on the network are continuously abundant, the information data volume is rapidly expanding, and the acquisition of information through a search engine has become the main way for modern people to acquire information. To provide users with more convenient and accurate query services is the development direction of search engine technology in the present and future.
In search engine technology, identifying the search requirement of a user is an important ring for improving the accuracy and effectiveness of search, and particularly plays a significant role in structured search (i.e. vertical search). For example, when a user inputs a query of "how to do a bus from hundredths building to five mouths", the user expects a map result of a bus route directly from a starting point to an end point, and therefore, a search engine is required to be capable of recognizing a search requirement of the query with a map class, so as to realize structured search of the bus route in a structured database of the map class. When the requirement identification is performed on the query input by the user, a commonly used method is to match the query input by the user based on the established requirement identification template, and determine the search requirement by using the matched requirement identification template. In the prior art, the requirement identification template is usually configured manually, that is, the commonly used query structure is observed manually, and the commonly used requirement identification template is summarized, for example, for maps, requirement identification templates such as "how to go from [ place name ] to [ place name ]," [ organization name ] at what position "and the like are configured manually, however, the manually configured requirement identification template has the following defects:
firstly, human resources are consumed, and the efficiency of establishing the demand identification template is low.
Secondly, the recall rate of the queries is low, namely, the number of the queries which can be covered is limited, and the application range is narrow.
[ summary of the invention ]
The invention provides an automatic mining method of a demand identification template, a demand identification method and a corresponding device, which are used for saving human resources and expanding the query range covered by demand identification.
The specific technical scheme is as follows:
a method for automatic mining of a demand recognition template, the method comprising:
s1, determining a query set corresponding to a clicked webpage of a preset type in a search log;
s2, selecting the total clicked times of the webpages corresponding to the preset types from the query set to exceed a preset time threshold, and/or selecting the selected query as the seed query of the preset types corresponding to the query with the webpage click ratio exceeding a preset click ratio threshold, wherein the webpage click ratio of the preset types corresponding to the query is as follows: the ratio of the total number of times that the query is clicked corresponding to the preset type of web pages to the total number of times that the query is clicked corresponding to all the web pages is clicked;
s3, matching each sub-query with the dictionary of the preset type, and replacing words matched with the dictionary in the sub-query with attribute marks of corresponding words in the dictionary to obtain a template set of the preset type, wherein the dictionary comprises the words and the attribute marks of the words;
and S4, determining the requirement identification template of the preset type by using the template set of the preset type.
According to a preferred embodiment of the present invention, the step S1 specifically includes:
determining the types of the webpages in the search logs, collecting the webpages of the preset types, and determining that all corresponding queries form the query set when the webpages of the preset types are clicked; or,
determining sites of a preset type, and forming a query set by all queries corresponding to the clicked web pages of the sites of the preset type in the search log.
According to a preferred embodiment of the present invention, the words in the dictionary include: naming an entity and the feature words of the preset type.
According to a preferred embodiment of the invention, the method further comprises:
calculating the accuracy and/or recall rate of each template in the template set;
wherein, the accuracy of the template is as follows: the ratio of the sum of the click ratios of the webpage of the query covered by the template corresponding to the preset type to the number of the query covered by the template;
the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of the preset type.
According to a preferred embodiment of the present invention, the step S4 specifically includes:
determining each template in the template set as the preset type of requirement identification template; or,
and selecting a template with the accuracy higher than a preset accuracy threshold value and/or the recall rate higher than a preset recall rate threshold value from the template set as the requirement identification template of the preset type.
A method of identifying a demand, the method comprising:
a1, matching the query to be recognized with dictionaries of preset types respectively, and replacing words matched with the dictionaries in the query to be recognized with attribute marks of corresponding words in the dictionaries to obtain semantic marks of the query to be recognized, wherein the dictionaries comprise the words and the attribute marks of the words;
a2, matching the semantic labels of the query to be identified with requirement identification templates of various preset types respectively, and determining the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template;
and the requirement identification templates of the preset types are automatically excavated by the automatic excavation method of the requirement identification templates.
According to a preferred embodiment of the present invention, the words in the dictionary include: naming the entity and corresponding preset type of feature words.
According to a preferred embodiment of the present invention, in step a1, if there is a word that matches the same word to multiple dictionaries in the query to be recognized, the word is replaced by using attribute labels of the matched words of the multiple dictionaries, so as to obtain semantic labels of the multiple queries to be recognized.
According to a preferred embodiment of the present invention, when the semantic annotation of the query to be identified is matched to multiple requirement identification templates, the requirement type of the query to be identified is further determined in the step a2 by combining the accuracy and/or recall of each matched requirement identification template.
According to a preferred embodiment of the present invention, the determining the requirement type of the query to be identified in step a2 includes:
determining a requirement type corresponding to a requirement identification template with accuracy and/or recall meeting preset requirements in each matched requirement identification template as the requirement type of the query to be identified; or,
determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in each matched requirement identification template as the requirement types of the query to be identified, wherein N is a preset positive integer; or,
and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or recall rate of each matched requirement identification template.
An automatic mining apparatus for a demand recognition template, the apparatus comprising:
the first selection unit is used for determining a query set corresponding to a clicked webpage of a preset type in a search log;
a second selecting unit, configured to select, from the query set, a query whose total number of times that the webpage corresponding to the preset type is clicked exceeds a preset number threshold, and/or a query whose webpage click ratio corresponding to the preset type exceeds a preset click ratio threshold, and use the selected query as a seed query of the preset type, where the webpage click ratio corresponding to the preset type is: the ratio of the total number of times that the query is clicked corresponding to the preset type of web pages to the total number of times that the query is clicked corresponding to all the web pages is clicked;
the label replacing unit is used for respectively matching various sub-queries with the dictionary of the preset type, and obtaining the template set of the preset type after replacing words matched with the dictionary in the sub-queries with attribute labels of corresponding words in the dictionary, wherein the dictionary comprises the words and the attribute labels of the words;
and the template determining unit is used for determining the requirement identification template of the preset type by utilizing the template set of the preset type.
According to a preferred embodiment of the present invention, the first selecting unit specifically determines the type of the web page in the search log, collects the web page of the preset type, and determines that all the query sets corresponding to the web page of the preset type are formed when the web page of the preset type is clicked; or,
determining sites of a preset type, and forming a query set by all queries corresponding to the clicked web pages of the sites of the preset type in the search log.
According to a preferred embodiment of the present invention, the words in the dictionary include: naming an entity and the feature words of the preset type.
According to a preferred embodiment of the present invention, the apparatus further comprises: the weight calculation unit is used for calculating the accuracy and/or recall rate of each template in the template set;
wherein, the accuracy of the template is as follows: the ratio of the sum of the click ratios of the webpage of the query covered by the template corresponding to the preset type to the number of the query covered by the template;
the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of the preset type.
According to a preferred embodiment of the present invention, the template determining unit determines each template in the template set as the requirement identification template of the preset type; or,
and selecting a template with the accuracy higher than a preset accuracy threshold value and/or the recall rate higher than a preset recall rate threshold value from the template set as the requirement identification template of the preset type.
A demand recognition apparatus, comprising:
the semantic annotation unit is used for matching the query to be recognized with dictionaries of various preset types respectively, replacing words matched with the dictionaries in the query to be recognized with attribute marks of corresponding words in the dictionaries to obtain semantic annotations of the query to be recognized, wherein the dictionaries comprise the words and the attribute marks of the words;
the requirement determining unit is used for matching the semantic labels of the query to be identified with requirement identification templates of all preset types respectively, and determining the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template;
and the requirement identification templates of the preset types are automatically excavated by the automatic excavating device of the requirement identification templates.
According to a preferred embodiment of the present invention, the words in the dictionary include: naming the entity and corresponding preset type of feature words.
According to a preferred embodiment of the present invention, when there is a word that is matched with multiple dictionaries by the same word in the query to be recognized, the semantic labeling unit uses attribute labels of the matched words in the multiple dictionaries to perform replacement respectively, so as to obtain semantic labels of the multiple queries to be recognized.
According to a preferred embodiment of the present invention, when semantic labels of the query to be identified are matched to multiple requirement identification templates, the requirement determining unit determines the requirement type of the query to be identified by combining the accuracy and/or recall of each matched requirement identification template.
According to a preferred embodiment of the present invention, when determining the requirement type of the query to be identified, the requirement determining unit determines, as the requirement type of the query to be identified, a requirement type corresponding to a requirement identification template whose accuracy and/or recall meets a preset requirement in each matched requirement identification template; or,
determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in each matched requirement identification template as the requirement types of the query to be identified, wherein N is a preset positive integer; or,
and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or recall rate of each matched requirement identification template.
According to the technical scheme, the method and the device have the advantages that the query set corresponding to the clicked webpage with the preset type in the search log is collected to obtain the seed query with the preset type, the seed query is matched with the dictionary with the preset type, and the attribute mark replacement mode is carried out to determine the requirement recognition template with the preset type, so that the requirement recognition template can be automatically mined. According to the method, manual participation is not needed, human resources are greatly saved, and the mining of the requirement identification template is from the search log, so that the expression is more in line with the search habit of the user, a large number of queries can be covered, and the recall rate is improved.
[ description of the drawings ]
Fig. 1 is a flowchart of a mining method for a requirement identification template according to an embodiment of the present invention;
FIG. 2 is a flowchart of a demand identification method according to a second embodiment of the present invention;
fig. 3 is a structural diagram of an automatic digging device of a requirement identification template according to a third embodiment of the present invention;
fig. 4 is a schematic structural diagram of a demand identification device according to a fourth embodiment of the present invention;
FIG. 5 is a diagram illustrating an example of a requirement identification for vertical search according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating an example of requirement identification for information recommendation according to an embodiment of the present invention.
[ detailed description ] embodiments
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
When mining the requirement identification template, a large number of queries with the same type of requirements are needed, the queries are used for determining the requirement identification template corresponding to the type of requirements, and the queries with the same type of requirements are referred to as seed queries. After the search behavior of the user is analyzed, the search behavior is found, and after the user inputs query to search, the webpage clicked in the search result can usually reflect the search requirement of the user. For example, after the user inputs the query "Shanghai masses recruitment", some web pages of recruitment websites are clicked in the search result, so that the seed query can be mined by using the web pages clicked by the user. The following describes a mining method for a requirement identification template provided by the present invention with reference to an embodiment.
The first embodiment,
Fig. 1 is a flowchart of a mining method for a requirement identification template according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101: and determining a query set corresponding to the clicked webpage with the preset type in the search log.
In this step, the following two modes can be adopted:
the first method comprises the following steps: after the webpage type is determined by the existing webpage type identification method, the webpages of the preset type are collected, and it is determined in a search log that all corresponding queries form a query set when the webpages are clicked.
The type of each web page in the search log can be determined by adopting a web page classification method based on text features, or by adopting a method of calculating the similarity between a web page text feature vector and a feature vector of a preset type, and the like, and then the web pages of the preset type are collected.
And the second method comprises the following steps: sites (sites) of a preset type can also be determined, the webpages of the sites are regarded as webpages of the preset type, and all the corresponding queries when the webpages of the sites belonging to the preset type in the search logs are clicked constitute a query set.
For example, for a website of the recruitment class, it may include: the query set comprises query sets and candidate seed queries, wherein the query sets comprise query sets, and the query sets are used as templates for subsequently extracting the recruitment requirements.
Step 102: selecting a query with the total number of clicked web pages corresponding to the preset type exceeding a preset number threshold and/or the click ratio of the web pages corresponding to the type exceeding a preset click ratio threshold from the query set as a seed query; the webpage click ratio of the type corresponding to the query is as follows: the ratio of the total number of clicked times of the webpage of the query corresponding to the preset type to the total number of clicked times of all the webpages of the query corresponding to the preset type.
For example, if the total number of clicked web pages of a query corresponding to a recruitment category is 180, and the total number of clicked web pages of the query corresponding to all the web pages is 500, the click ratio of the web pages of the query corresponding to the recruitment category is as follows: 180/500 ═ 0.36, the web page click ratio reflects the likelihood that the query will belong to a recruit class of demand.
The total number of times that the webpage of the preset type corresponding to the query is clicked also reflects the possibility that the query belongs to the recruitment requirement, so that the query of which the total number of times that the webpage of the preset type is clicked in the query set exceeds a preset number threshold can be used as the seed query, the query of which the webpage click ratio of the type corresponding to the query set exceeds a preset click ratio threshold can be used as the seed query, or the query of which the total number of times that the webpage of the preset type corresponding to the query set is clicked exceeds a preset number threshold and the webpage click ratio of the type corresponding to the webpage exceeds a preset click ratio threshold can be used as the seed query.
Taking the recruitment class as an example, assuming that the determined query set is shown in table 1, and there are 42 queries in the query set, if a query corresponding to a preset type of web page is selected from the query set, the total number of times of being clicked exceeds 1, and the click ratio of the web page corresponding to the preset type of web page exceeds 0.05 is used as a seed query, 40 seed queries are obtained.
TABLE 1
Step 103: matching various sub-queries with the dictionary of the preset type, and replacing words matched with the dictionary in the sub-queries with attribute marks of corresponding words in the dictionary to obtain a template set of the preset type; wherein the lexicon comprises words and attribute labels of the words.
In the embodiment of the invention, a dictionary of a preset type is obtained in advance in a manual mode or a machine mining mode, the dictionary comprises words and attribute marks of the words, and the words in the dictionary comprise named entities and characteristic words of the preset type. Named entities may include, but are not limited to: name of person, place, dish, organization, job, etc. The predetermined type of dictionary may be obtained in an existing manner, and is not described in detail herein.
For example, the corresponding attribute of the actuarian, the senior engineer, etc. is the job title, and [ POS ] can be marked as the attribute mark; corresponding attributes of Beijing, Shanghai and the like are place names, and can be marked with LOC as attribute marks; corresponding attributes of an agricultural development bank, a first steam group and the like are organization names, and [ ORG ] can be marked as attribute marks; the attribute of recruitment, recruitment concurrent employment and campus recruitment is a characteristic word of the recruitment class, and the attribute mark can be 'JOB'.
The seed query obtained by using the table 1 is matched with the recruitment dictionary, and the matched words are replaced by attribute marks of the words in the corresponding dictionary, so that an obtained template set is shown in table 2.
TABLE 2
Template set | Recruitment web page click ratio |
【ORG】【JOB】 | 0.3361 |
【ORG】【JOB】 | 0.4779 |
【ORG】【JOB】 | 0.1302 |
【ORG】【JOB】 | 0.3005 |
【ORG】2012【JOB】 | 0.5585 |
【ORG】【JOB】 | 0.1373 |
【ORG】2012【JOB】 | 0.5090 |
【ORG】 | 0.3827 |
【ORG】【JOB】 | 0.8175 |
【ORG】【JOB】【POS】 | 0.5822 |
【ORG】【JOB】【POS】 | 0.0996 |
【ORG】【JOB】 | 0.3258 |
【ORG】【JOB】 | 0.2114 |
【ORG】【JOB】【POS】 | 0.2890 |
【ORG】【JOB】【POS】 | 0.1399 |
【ORG】【JOB】 | 0.5478 |
【ORG】【JOB】 | 0.1703 |
【ORG】【JOB】 | 0.3769 |
【ORG】2012【JOB】 | 0.3851 |
【ORG】【JOB】 | 0.13749 --> |
【ORG】【JOB】 | 0.1165 |
【ORG】2012【JOB】 | 0.5612 |
【ORG】2012【JOB】 | 0.6778 |
【ORG】【JOB】 | 0.7330 |
【ORG】【JOB】 | 0.0555 |
【ORG】2012【JOB】 | 0.1187 |
【ORG】【JOB】 | 0.4873 |
【ORG】2012【JOB】 | 0.3438 |
【ORG】【JOB】 | 0.6965 |
【ORG】2012【JOB】 | 0.5814 |
【ORG】【JOB】 | 0.2151 |
2012【ORG】【JOB】 | 0.2178 |
【ORG】【JOB】 | 0.7455 |
【ORG】2012【JOB】 | 0.5825 |
[ LOC ] mid-autumn [ JOB ] | 0.4708 |
【ORG】【JOB】 | 0.1453 |
【ORG】【JOB】【POS】 | 0.8000 |
【ORG】【JOB】【POS】 | 0.3636 |
【ORG】【JOB】【POS】 | 0.3333 |
【LOC】【JOB】【POS】 | 0.2500 |
Because the number of the queries which can be covered by different templates is different, the identification accuracy and the recall rate of the templates are different. Here, the accuracy and recall of each template in the template set may be further calculated.
Wherein, the accuracy of the template can be: the ratio of the sum of the click ratios of the webpage corresponding to the preset types of the query covered by the template to the number of the query covered by the template.
Taking the template "[ ORG ] 2012 [ JOB ] as an example, if the template covers 9 queries in table 1, the accuracy is:
(0.558+0.509+0.3851+0.5612+0.6778+0.1187+0.3438+0.5814+0.5825)/9=47.97%
the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of the predetermined type.
Still taking the template "[ ORG ] 2012 [ JOB ] as an example, in table 1, it covers 9 queries, and since the seed query of the recruitment class is 40, its recall rate is:
9/40=22.50%。
in this manner of calculation, the accuracy and recall of each template in Table 2 are shown in Table 3. The accuracy and recall of the template may be used for further screening of the requirement identification template, see step 104, or may be used for selecting a template when a plurality of requirement identification templates are matched in the query requirement identification process, see embodiment two.
TABLE 3
Semantic annotation template | Rate of accuracy | Recall rate |
【ORG】【JOB】 | 35.82% | 50.00%10 --> |
【ORG】2012【JOB】 | 47.97% | 22.50% |
【ORG】 | 38.27% | 2.50% |
【ORG】【JOB】【POS】 | 13.04% | 17.50% |
2012【ORG】【JOB】 | 21.78% | 2.50% |
[ LOC ] mid-autumn [ JOB ] | 47.08% | 2.50% |
【LOC】【JOB】【POS】 | 25.00% | 2.50% |
In the embodiment of the present invention, the template in the template set obtained in step 103 may be directly used as the requirement identification template of the preset type, or step 104 may be further performed.
Step 104: and selecting a template with the accuracy higher than a preset accuracy threshold value and/or the recall rate higher than a preset recall rate threshold value from the template set as the requirement identification template of the preset type.
In this case, the tendency may be selected according to an actual situation, and if the accuracy rate of the demand recognition tends to be higher, a template having an accuracy rate higher than a preset accuracy rate threshold may be selected from the template set as the demand recognition template of the preset type. If the recall rate of the demand identification is inclined, a template with the recall rate higher than a preset recall rate threshold value can be selected from the template set as the demand identification template of the preset type. And if the accuracy and the recall rate are considered simultaneously, selecting a template with the accuracy higher than a preset accuracy threshold and the recall rate higher than a preset recall rate threshold from the template set as the requirement identification template of the preset type.
After the requirement identification templates of the preset types are obtained according to the first embodiment for the preset types, the query may be identified by using the requirement identification templates, and a description is given below with reference to the two-pair requirement identification process in the embodiment.
Example II,
Fig. 2 is a flowchart of a demand identification method according to a second embodiment of the present invention, as shown in fig. 2, the method includes the following steps:
step 201: and matching the query to be recognized with dictionaries of various preset types respectively, and replacing words matched with the dictionaries in the query to be recognized with attribute marks of corresponding words in the dictionaries to obtain semantic labels of the query to be recognized.
The query to be identified can be a query input by a user on a search interface provided by a search engine.
In this step, the attribute label replacement of the query to be recognized is similar to the attribute label replacement of the seed query in step 103 of the first embodiment, except that the query to be recognized needs to be matched with dictionaries of preset types respectively.
For example: for the query to be identified, replacing the attribute marks by recruiting field residents of the Chinese food marketing limited company, and obtaining semantic marks as follows: "[ ORG ] [ JOB ] [ POS ]", wherein [ ORG ] is the mark of the organization name, [ JOB ] is the mark of the recruitment characteristic word, and [ POS ] is the mark of the position name.
Because the same word in the query to be recognized can be matched with the words in the dictionaries at the same time, the attribute marks of the words in the dictionaries are respectively used for replacement, and the semantic labels of the queries to be recognized can be obtained.
For example: for the query to be identified, namely the latest chapter of the software empire, matching with the novel dictionary and then performing attribute label replacement, wherein the obtained semantic labels are as follows: "BOK" is the latest "NOV", wherein "BOK" is the label of the book name and "NOV" is the label of the characteristic words of the novel class. And matching the attribute label with the game dictionary, and then performing attribute label replacement to obtain semantic labels as follows: "[ SOF ] section of the empire state, wherein [ SOF ] is a mark of the software characteristic words.
Step 202: and matching the semantic labels of the query to be identified with the requirement identification templates of the preset types respectively, and determining the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template.
The matching in this step is actually to judge whether a requirement identification template consistent with the semantic label of the query to be identified exists, and the preset type corresponding to the matched requirement identification template can be directly determined as the requirement type of the query to be identified.
The requirement type of the query to be identified can also be determined by combining the accuracy and/or recall rate of each requirement identification template, and the following three ways can be specifically adopted:
the first mode is as follows: and determining the requirement type corresponding to the requirement identification template with the accuracy and/or the recall ratio meeting the preset requirement in each matched requirement identification template as the requirement type of the query to be identified.
The second mode is as follows: and determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in the matched requirement identification templates as the requirement types of the query to be identified, wherein N is a preset positive integer.
The third mode is as follows: and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or the recall rate of each matched requirement identification template.
Namely, the requirement levels can be divided into the following according to the accuracy and/or recall rate of the identification templates of each requirement: the method comprises the steps of determining whether a query to be identified is a strong demand, a general demand, a weak demand or no demand on each demand type according to the accuracy and/or recall rate of each matched demand identification template.
The above is a detailed description of the method provided by the present invention, and the following is a detailed description of the apparatus provided by the present invention through the third and fourth embodiments, respectively.
Example III,
Fig. 3 is a structural diagram of an automatic mining apparatus for a requirement identification template according to a third embodiment of the present invention, as shown in fig. 3, the apparatus may include: a first selection unit 301, a second selection unit 302, a marker replacement unit 303 and a template determination unit 304.
The first selection unit 301 determines a query set corresponding to a clicked webpage of a preset type in a search log.
The query set can be determined specifically by the following two ways:
the first mode is as follows: the first selection unit 301 determines the types of the webpages in the search logs, collects the webpages of the preset types, and determines that all the corresponding queries form a query set when the webpages of the preset types are clicked. When determining the type of the web page in the search log, the existing methods such as a web page classification method based on text features, or a method of calculating the similarity between a web page text feature vector and a preset type of feature vector may be adopted.
The second mode is as follows: and determining sites of a preset type, and forming a query set by all queries corresponding to the clicked web pages of the sites of the preset type in the search log.
The second selecting unit 302 selects, from the query set, a query whose total number of times that the webpage of the corresponding preset type is clicked exceeds a preset number threshold, and/or a query whose webpage click ratio of the corresponding preset type exceeds a preset click ratio threshold, and takes the selected query as a seed query of the preset type, where the webpage click ratio of the preset type corresponding to the query is: the ratio of the total number of clicked times of the webpage of the query corresponding to the preset type to the total number of clicked times of all the webpages of the query corresponding to the preset type.
The label replacing unit 303 matches each of the sub-queries with a dictionary of a preset type, and replaces the words in the sub-query that are matched with the dictionary with attribute labels of corresponding words in the dictionary to obtain a template set of the preset type.
In the embodiment of the invention, a dictionary of a preset type is obtained in advance in a manual mode or a machine mining mode, the dictionary comprises words and attribute marks of the words, and the words in the dictionary comprise named entities and characteristic words of the preset type. Named entities may include, but are not limited to: name of person, place, dish, organization, job, etc. The predetermined type of dictionary may be obtained in an existing manner, and is not described in detail herein.
The template determination unit 304 determines a requirement identification template of a preset type using a set of templates of the preset type.
Because the number of the queries which can be covered by different templates is different, the identification accuracy and the recall rate of the templates are different. Here, the accuracy and recall of each template in the template set may be further calculated. At this time, the apparatus may further include: the weight calculation unit 305 is configured to calculate an accuracy and/or a recall of each template in the template set. Wherein, the accuracy of the template is as follows: the ratio of the sum of the click ratios of the web pages of the query covered by the template corresponding to the preset type to the number of the queries covered by the template; the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of a predetermined type.
The calculated accuracy and recall of the templates can be used for template selection when a plurality of requirement identification templates are matched in the query requirement identification process, and refer to the fourth embodiment; the template determination unit 304 may select a template with an accuracy higher than a preset accuracy threshold and/or a recall rate higher than a preset recall rate threshold from the template set as the requirement identification template of the preset type. Of course, the template determining unit 304 may also directly determine each template in the template set as the preset type of the requirement identification template.
Example four,
Fig. 4 is a schematic structural diagram of a requirement identification apparatus according to a fourth embodiment of the present invention, as shown in fig. 4, the requirement identification apparatus includes: a semantic annotation unit 401 and a requirement determination unit 402.
The semantic annotation unit 401 matches the query to be recognized with each dictionary of a preset type, and replaces the words in the query to be recognized, which are matched with the dictionary, with the attribute labels of the corresponding words in the dictionary, so as to obtain the semantic annotation of the query to be recognized, where the dictionary includes the words and the attribute labels of the words.
In the device, the attribute tag replacement of the query to be recognized by the semantic labeling unit 401 is similar to the attribute tag replacement of the seed query by the title replacement unit 303 in the third embodiment, except that the semantic labeling unit 401 needs to match the query to be recognized with dictionaries of various preset types respectively.
Also, words in the dictionary include: naming the entity and corresponding preset type of feature words.
The requirement determining unit 402 matches the semantic labels of the query to be identified with requirement identification templates of each preset type, and determines the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template.
Wherein, each preset type of requirement identification template is automatically excavated by the device as described in the third embodiment.
When the query to be recognized is matched with each dictionary of the preset type, the semantic labeling unit 401 may have a situation that the same word is matched with words of multiple dictionaries in the query to be recognized, and at this time, the attribute labels of the words of the multiple matched dictionaries may be respectively replaced to obtain semantic labels of the multiple queries to be recognized.
In addition, when performing template matching, the requirement determining unit 402 may match one semantic label to multiple requirement identification templates, and at this time, may perform requirement identification by using the accuracy and/or recall ratio of each requirement identification template calculated in the third embodiment, that is, when the semantic label of the query to be identified is matched to multiple requirement identification templates, the requirement determining unit 402 determines the requirement type of the query to be identified by combining the accuracy and/or recall ratio of each matched requirement identification template.
Specifically, when determining the requirement type of the query to be identified by combining the accuracy and/or recall of each matched requirement identification template, the requirement determining unit 402 may adopt the following three ways:
the first mode is as follows: and determining the requirement type corresponding to the requirement identification template with the accuracy and/or the recall ratio meeting the preset requirement in each matched requirement identification template as the requirement type of the query to be identified.
The second mode is as follows: and determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in the matched requirement identification templates as the requirement types of the query to be identified, wherein N is a preset positive integer.
The third mode is as follows: and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or the recall rate of each matched requirement identification template.
After the requirement type of the query is identified by using the method of the second embodiment or the apparatus of the fourth embodiment, the method can be used in, but is not limited to, the following application scenarios:
1) ranking for large searches. After a user inputs a query, identifying the requirement type of the query, and sequencing the page corresponding to the requirement type of the query in the search result of the large search in advance; or, sorting the pages in the search result according to the level of the corresponding requirement type, and the like.
For example, if the user inputs query "recruitment by an engineer" and recognizes that the requirement type is the recruitment class, the page ordering of the recruitment class can be advanced in the search result of the large search.
2) For vertical searching. After a user inputs a query, the requirement type of the query is identified, then the query is distributed to the optimal content resource or application provider for processing, and finally, the result matched with the query is accurately and efficiently returned to the user.
For example, if the user inputs query "how to sit at a bus from hundredth building to five mouths" to identify that the demand type is a map type, the user can perform vertical search in a database of the map type and return a vertical search result, that is, the information of the bus trip from hundredth building to five mouths is directly displayed on the map, as shown in fig. 5.
3) For information recommendation. After the user inputs the query, the requirement type of the query is identified, and information recommendation such as advertisement recommendation, recommendation of a knowledge question and answer platform, query recommendation and the like is performed on the user based on the requirement type.
For example, if the user inputs query "recruitment of engineers" to identify the requirement type as the recruitment class, advertisements related to recruitment of engineers can be recommended in the search result, so that the degree of matching between the advertisements and the actual requirements of the user is high. As shown in fig. 6, the portion in the dashed box is an advertisement recommended to the user that is relevant to the recruitment of the engineer.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (20)
1. An automatic mining method for a demand recognition template, the method comprising:
s1, determining a query set corresponding to a clicked webpage of a preset type in a search log;
s2, selecting the total clicked times of the webpages corresponding to the preset types from the query set to exceed a preset time threshold, and/or selecting the selected query as the seed query of the preset types corresponding to the query with the webpage click ratio exceeding a preset click ratio threshold, wherein the webpage click ratio of the preset types corresponding to the query is as follows: the ratio of the total number of times that the query is clicked corresponding to the preset type of web pages to the total number of times that the query is clicked corresponding to all the web pages is clicked;
s3, matching each sub-query with the dictionary of the preset type, and replacing words matched with the dictionary in the sub-query with attribute marks of corresponding words in the dictionary to obtain a template set of the preset type, wherein the dictionary comprises the words and the attribute marks of the words;
and S4, determining the requirement identification template of the preset type by using the template set of the preset type.
2. The method according to claim 1, wherein the step S1 specifically includes:
determining the types of the webpages in the search logs, collecting the webpages of the preset types, and determining that all corresponding queries form the query set when the webpages of the preset types are clicked; or,
determining sites of a preset type, and forming a query set by all queries corresponding to the clicked web pages of the sites of the preset type in the search log.
3. The method of claim 1, wherein the words in the dictionary comprise: naming an entity and the feature words of the preset type.
4. The method of claim 1, further comprising:
calculating the accuracy and/or recall rate of each template in the template set;
wherein, the accuracy of the template is as follows: the ratio of the sum of the click ratios of the webpage of the query covered by the template corresponding to the preset type to the number of the query covered by the template;
the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of the preset type.
5. The method according to claim 4, wherein the step S4 specifically includes:
determining each template in the template set as the preset type of requirement identification template; or,
and selecting a template with the accuracy higher than a preset accuracy threshold value and/or the recall rate higher than a preset recall rate threshold value from the template set as the requirement identification template of the preset type.
6. A demand identification method is characterized by comprising the following steps:
a1, matching the query to be recognized with dictionaries of preset types respectively, and replacing words matched with the dictionaries in the query to be recognized with attribute marks of corresponding words in the dictionaries to obtain semantic marks of the query to be recognized, wherein the dictionaries comprise the words and the attribute marks of the words;
a2, matching the semantic labels of the query to be identified with requirement identification templates of various preset types respectively, and determining the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template;
wherein each preset type of demand recognition template is automatically mined by a method as claimed in any one of claims 1 to 5.
7. The demand recognition method according to claim 6, wherein the words in the dictionary include: naming the entity and corresponding preset type of feature words.
8. The demand identification method according to claim 6, wherein in the step A1, if there is a word in the query to be identified that matches the same word to multiple dictionaries, the word is replaced by using attribute labels of the matched words in the multiple dictionaries, so as to obtain semantic labels of the multiple queries to be identified.
9. The demand identification method according to claim 6 or 8, wherein if each demand identification template of a preset type is mined by the method of claim 4, when the semantic annotation of the query to be identified is matched to a plurality of demand identification templates, the demand type of the query to be identified is further determined in the step A2 according to the accuracy and/or recall rate of each matched demand identification template.
10. The demand identification method according to claim 9, wherein the determining the demand type of the query to be identified in the step a2 comprises:
determining a requirement type corresponding to a requirement identification template with accuracy and/or recall meeting preset requirements in each matched requirement identification template as the requirement type of the query to be identified; or,
determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in each matched requirement identification template as the requirement types of the query to be identified, wherein N is a preset positive integer; or,
and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or recall rate of each matched requirement identification template.
11. An automatic mining device for a demand recognition template, the device comprising:
the first selection unit is used for determining a query set corresponding to a clicked webpage of a preset type in a search log;
a second selecting unit, configured to select, from the query set, a query whose total number of times that the webpage corresponding to the preset type is clicked exceeds a preset number threshold, and/or a query whose webpage click ratio corresponding to the preset type exceeds a preset click ratio threshold, and use the selected query as a seed query of the preset type, where the webpage click ratio corresponding to the preset type is: the ratio of the total number of times that the query is clicked corresponding to the preset type of web pages to the total number of times that the query is clicked corresponding to all the web pages is clicked;
the label replacing unit is used for respectively matching various sub-queries with the dictionary of the preset type, and obtaining the template set of the preset type after replacing words matched with the dictionary in the sub-queries with attribute labels of corresponding words in the dictionary, wherein the dictionary comprises the words and the attribute labels of the words;
and the template determining unit is used for determining the requirement identification template of the preset type by utilizing the template set of the preset type.
12. The apparatus according to claim 11, wherein the first selecting unit specifically determines a type of a web page in a search log, collects the web pages of a preset type, and determines that all corresponding queries when the web pages of the preset type are clicked constitute the query set; or,
determining sites of a preset type, and forming a query set by all queries corresponding to the clicked web pages of the sites of the preset type in the search log.
13. The apparatus of claim 11, wherein the words in the dictionary comprise: naming an entity and the feature words of the preset type.
14. The apparatus of claim 11, further comprising: the weight calculation unit is used for calculating the accuracy and/or recall rate of each template in the template set;
wherein, the accuracy of the template is as follows: the ratio of the sum of the click ratios of the webpage of the query covered by the template corresponding to the preset type to the number of the query covered by the template;
the recall rate of the template is: the ratio of the number of queries covered by the template to the number of seed queries of the preset type.
15. The apparatus according to claim 14, wherein the template determination unit determines each template in the template set as the requirement identification template of the preset type; or,
and selecting a template with the accuracy higher than a preset accuracy threshold value and/or the recall rate higher than a preset recall rate threshold value from the template set as the requirement identification template of the preset type.
16. A demand recognition apparatus, characterized by comprising:
the semantic annotation unit is used for matching the query to be recognized with dictionaries of various preset types respectively, replacing words matched with the dictionaries in the query to be recognized with attribute marks of corresponding words in the dictionaries to obtain semantic annotations of the query to be recognized, wherein the dictionaries comprise the words and the attribute marks of the words;
the requirement determining unit is used for matching the semantic labels of the query to be identified with requirement identification templates of all preset types respectively, and determining the requirement type of the query to be identified by using the type corresponding to the matched requirement identification template;
wherein each preset type of demand recognition template is automatically mined by an apparatus as claimed in any one of claims 11 to 15.
17. The demand recognition apparatus of claim 16, wherein the words in the dictionary comprise: naming the entity and corresponding preset type of feature words.
18. The demand recognition device according to claim 16, wherein the semantic labeling unit, when there is a word that matches one word to multiple dictionaries in the query to be recognized, replaces the word with attribute labels of the matched words in the multiple dictionaries respectively to obtain semantic labels of the multiple queries to be recognized.
19. The demand identification device of claim 16 or 18, wherein if each demand identification template of a preset type is mined by the device of claim 14, the demand determination unit determines the demand type of the query to be identified in combination with the accuracy and/or recall of each matched demand identification template when the semantic annotation of the query to be identified is matched to a plurality of demand identification templates.
20. The demand identification device of claim 19, wherein when determining the demand type of the query to be identified, the demand determination unit determines, as the demand type of the query to be identified, a demand type corresponding to a demand identification template whose accuracy and/or recall meets preset requirements in each matched demand identification template; or,
determining the requirement types corresponding to the first N requirement identification templates with the accuracy and/or the recall rate in each matched requirement identification template as the requirement types of the query to be identified, wherein N is a preset positive integer; or,
and determining the requirement level of the query to be identified on each requirement type according to the requirement level corresponding to the accuracy and/or recall rate of each matched requirement identification template.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110286986.XA CN103020083B (en) | 2011-09-23 | 2011-09-23 | The automatic mining method of demand recognition template, demand recognition methods and corresponding device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201110286986.XA CN103020083B (en) | 2011-09-23 | 2011-09-23 | The automatic mining method of demand recognition template, demand recognition methods and corresponding device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103020083A CN103020083A (en) | 2013-04-03 |
CN103020083B true CN103020083B (en) | 2016-06-15 |
Family
ID=47968697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201110286986.XA Active CN103020083B (en) | 2011-09-23 | 2011-09-23 | The automatic mining method of demand recognition template, demand recognition methods and corresponding device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103020083B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105243052A (en) * | 2015-09-15 | 2016-01-13 | 浪潮软件集团有限公司 | Corpus labeling method, device and system |
CN106682192B (en) * | 2016-12-29 | 2020-07-03 | 北京奇虎科技有限公司 | Method and device for training answer intention classification model based on search keywords |
CN107832285B (en) * | 2017-08-09 | 2021-02-23 | 联动优势科技有限公司 | Dictionary creating method and equipment |
CN107526812A (en) * | 2017-08-24 | 2017-12-29 | 北京奇艺世纪科技有限公司 | A kind of searching method, device and electronic equipment |
CN107832468B (en) * | 2017-11-29 | 2019-05-10 | 百度在线网络技术(北京)有限公司 | Demand recognition methods and device |
CN112164400A (en) * | 2020-09-18 | 2021-01-01 | 广州小鹏汽车科技有限公司 | Voice interaction method, server and computer-readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000626A (en) * | 2007-01-12 | 2007-07-18 | 宋晓伟 | Information storing method and method for converting search inquiry into inquiry statement |
CN101055587A (en) * | 2007-05-25 | 2007-10-17 | 清华大学 | Search engine retrieving result reordering method based on user behavior information |
CN101178728A (en) * | 2007-11-21 | 2008-05-14 | 北京搜狗科技发展有限公司 | Web side navigation method and system |
CN102073725A (en) * | 2011-01-11 | 2011-05-25 | 百度在线网络技术(北京)有限公司 | Method for searching structured data and search engine system for implementing same |
CN102096716A (en) * | 2011-02-11 | 2011-06-15 | 百度在线网络技术(北京)有限公司 | Search engine-based calculator realizing method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832133B2 (en) * | 2009-08-24 | 2014-09-09 | Microsoft Corporation | Answering web queries using structured data sources |
-
2011
- 2011-09-23 CN CN201110286986.XA patent/CN103020083B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000626A (en) * | 2007-01-12 | 2007-07-18 | 宋晓伟 | Information storing method and method for converting search inquiry into inquiry statement |
CN101055587A (en) * | 2007-05-25 | 2007-10-17 | 清华大学 | Search engine retrieving result reordering method based on user behavior information |
CN101178728A (en) * | 2007-11-21 | 2008-05-14 | 北京搜狗科技发展有限公司 | Web side navigation method and system |
CN102073725A (en) * | 2011-01-11 | 2011-05-25 | 百度在线网络技术(北京)有限公司 | Method for searching structured data and search engine system for implementing same |
CN102096716A (en) * | 2011-02-11 | 2011-06-15 | 百度在线网络技术(北京)有限公司 | Search engine-based calculator realizing method and device |
Non-Patent Citations (1)
Title |
---|
基于查询模板的特定领域中文问答系统的研究与实现;刘亮亮 等;《江苏科技大学学报(自然科学版)》;20110415;第25卷(第2期);163-168 * |
Also Published As
Publication number | Publication date |
---|---|
CN103020083A (en) | 2013-04-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101364239B (en) | Method for auto constructing classified catalogue and relevant system | |
CN101151607B (en) | Method and system for providing reviews for a product | |
US9317550B2 (en) | Query expansion | |
CN101320375B (en) | Digital book search method based on user click action | |
JP2005085285A5 (en) | ||
CN103186574B (en) | A kind of generation method and apparatus of Search Results | |
US8271495B1 (en) | System and method for automating categorization and aggregation of content from network sites | |
CN103020083B (en) | The automatic mining method of demand recognition template, demand recognition methods and corresponding device | |
CN101299217B (en) | Method, apparatus and system for processing map information | |
CN105718579A (en) | Information push method based on internet-surfing log mining and user activity recognition | |
JP6428795B2 (en) | Model generation method, word weighting method, model generation device, word weighting device, device, computer program, and computer storage medium | |
CN103955529A (en) | Internet information searching and aggregating presentation method | |
CN102567494B (en) | Website classification method and device | |
CN102456016B (en) | Method and device for sequencing search results | |
CN105426514A (en) | Personalized mobile APP recommendation method | |
CN102722499B (en) | Search engine and implementation method thereof | |
CN102880721B (en) | The implementation method of vertical search engine | |
CN102722498A (en) | Search engine and implementation method thereof | |
CN105677857B (en) | method and device for accurately matching keywords with marketing landing pages | |
CN101751439A (en) | Image retrieval method based on hierarchical clustering | |
CN102456054A (en) | Searching method and system | |
CN100470549C (en) | Form locating data mining method | |
CN113792209B (en) | Search term generation method, system and computer readable storage medium | |
CN105677664A (en) | Compactness determination method and device based on web search | |
CN102460440B (en) | Searching methods and devices |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |