CN106446122B - Information retrieval method and device and computing equipment - Google Patents
Information retrieval method and device and computing equipment Download PDFInfo
- Publication number
- CN106446122B CN106446122B CN201610829891.0A CN201610829891A CN106446122B CN 106446122 B CN106446122 B CN 106446122B CN 201610829891 A CN201610829891 A CN 201610829891A CN 106446122 B CN106446122 B CN 106446122B
- Authority
- CN
- China
- Prior art keywords
- keywords
- keyword
- complete
- index
- document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000012545 processing Methods 0.000 claims description 100
- 230000011218 segmentation Effects 0.000 claims description 40
- PCTMTFRHKVHKIS-BMFZQQSSSA-N (1s,3r,4e,6e,8e,10e,12e,14e,16e,18s,19r,20r,21s,25r,27r,30r,31r,33s,35r,37s,38r)-3-[(2r,3s,4s,5s,6r)-4-amino-3,5-dihydroxy-6-methyloxan-2-yl]oxy-19,25,27,30,31,33,35,37-octahydroxy-18,20,21-trimethyl-23-oxo-22,39-dioxabicyclo[33.3.1]nonatriaconta-4,6,8,10 Chemical compound C1C=C2C[C@@H](OS(O)(=O)=O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2.O[C@H]1[C@@H](N)[C@H](O)[C@@H](C)O[C@H]1O[C@H]1/C=C/C=C/C=C/C=C/C=C/C=C/C=C/[C@H](C)[C@@H](O)[C@@H](C)[C@H](C)OC(=O)C[C@H](O)C[C@H](O)CC[C@@H](O)[C@H](O)C[C@H](O)C[C@](O)(C[C@H](O)[C@H]2C(O)=O)O[C@H]2C1 PCTMTFRHKVHKIS-BMFZQQSSSA-N 0.000 description 12
- 241000219095 Vitis Species 0.000 description 12
- 235000009754 Vitis X bourquina Nutrition 0.000 description 12
- 235000012333 Vitis X labruscana Nutrition 0.000 description 12
- 235000014787 Vitis vinifera Nutrition 0.000 description 12
- 238000010586 diagram Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 238000001914 filtration Methods 0.000 description 7
- 238000006243 chemical reaction Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 102100035353 Cyclin-dependent kinase 2-associated protein 1 Human genes 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 102100031554 Double C2-like domain-containing protein alpha Human genes 0.000 description 1
- 101000866272 Homo sapiens Double C2-like domain-containing protein alpha Proteins 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2468—Fuzzy queries
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Fuzzy Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computational Linguistics (AREA)
- Automation & Control Theory (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides an information retrieval method, an information retrieval device and computing equipment, wherein the method comprises the following steps: receiving a query statement, wherein the query statement comprises fuzzy keywords, and the fuzzy keywords are words containing matching characters; determining a first part of keywords, wherein the first part of keywords are parts except for the matching characters in the fuzzy keywords; acquiring a first complete keyword according to a first index and a first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, information stored in the key value part has a corresponding relation with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword; and acquiring a first document according to the second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document. The invention can improve the efficiency of information retrieval in the fuzzy search scene.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for information retrieval, and a computing device.
Background
The information retrieval system is used for retrieving the relevant content required by the user according to the query sentence input by the user. The general search flow of the information search system is as follows: performing word segmentation on query sentences input by a user, then performing stop word filtering processing, performing keyword matching on keywords after the stop word filtering processing and index files to obtain corresponding documents, then scoring the documents, and outputting N documents with the highest scores to return to the user.
The fuzzy search scenario refers to a situation where a user inputs a fuzzy word AB when he does not remember the complete query word ABC, and then wants to find out the document containing ABC.
In order to support the fuzzy search scenario, the existing information retrieval scheme generally traverses the keywords in the index file by the fuzzy words (e.g., AB), to find the keywords matched with the fuzzy words, and then obtains the corresponding documents according to the matched keywords and the index file. Because the keywords of the index file need to be traversed by the fuzzy words, the time consumption of information retrieval is large, and the retrieval efficiency is low.
Disclosure of Invention
The invention provides an information retrieval method, an information retrieval device and computing equipment, which can effectively improve the efficiency of information retrieval in a fuzzy search scene.
In a first aspect, a method for information retrieval is provided, the method comprising: receiving a query statement, wherein the query statement comprises fuzzy keywords which are words containing matching characters; determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are the parts of the fuzzy keywords except the matching characters; acquiring a first complete keyword according to a first index and the first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, information stored in the key value part has a corresponding relation with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword; and acquiring a first document according to a second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document.
In the scheme, the first index comprises the corresponding relation between the partial keywords and the complete keywords, the complete keywords corresponding to the partial keywords can be quickly found through the first index, and compared with the prior art, the method and the device for searching the information in the fuzzy search scene can improve the overall efficiency of information search in the fuzzy search scene.
It should be understood that partial key 1 of the key-value portion of the first index corresponds to full key 1 of the attribute-value portion, and partial key 2 of the key-value portion of the first index corresponds to full key 2 of the attribute-value portion.
It is also understood that each partial keyword in the first index may correspond to one or more full keywords. For convenience of description and understanding, the present invention is described by taking a full keyword as an example, but the present invention is not limited to the scope of the present invention.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the method further includes: acquiring a complete keyword set according to the corpus data set; performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword; determining a third complete keyword, wherein the third complete keyword is a word in the complete keyword set, and the word comprises the second part of keywords; and generating the first index according to the second part of keywords and the third complete keywords, wherein the key value part of the first index comprises the second part of keywords, and the attribute value part of the first index comprises the third complete keywords.
According to the scheme, the first index comprising the corresponding relation between the partial keywords and the complete keywords is established, so that the complete keywords corresponding to the partial keywords can be quickly acquired according to the first index during information retrieval, and the overall efficiency of information retrieval in a fuzzy search scene can be improved.
With reference to the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the determining a third complete keyword includes: determining the third complete keyword based on a matching type, wherein the third complete keyword is matched with the second part of keywords based on the matching type, and the matching type is any one of front word matching, rear word matching or front and rear word matching; wherein the generating the first index comprises: generating the first index according to the second part of keywords, the matching type and the third complete keywords, wherein the attribute value part of the first index also comprises the matching type; the obtaining a first complete keyword according to the first index and the first partial keyword includes: and acquiring the first complete keyword according to the first index, the first part of keywords and the matching type of the first part of keywords, wherein the matching type of the first part of keywords is determined according to the position relation between the first part of keywords and the matching character.
With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the method further includes: determining the correlation parameters of the second part of keywords and the third complete keywords according to the occurrence frequency of the second part of keywords in the complete keyword set and the occurrence frequency of the third complete keywords in the complete keyword set; wherein the generating the first index comprises: generating the first index according to the second part of keywords, the matching type, the third complete keywords and the correlation parameters, wherein the attribute value part of the first index further comprises the correlation parameters of the second part of keywords and the third complete keywords; the method further comprises the following steps: acquiring correlation parameters of the first part of keywords and the first complete keywords according to the first index, the first part of keywords and the first complete keywords; and determining the score of the first document according to the correlation parameter of the first part of keywords and the first complete keywords and the correlation parameter of the first document and the query statement.
In the scheme, the scoring of the document where the complete keyword is located is determined based on the correlation parameters between the partial keywords and the complete keyword, so that the correlation of the retrieval result of the information retrieval in the fuzzy search scene can be improved.
With reference to the third possible implementation manner of the first aspect, in a fourth possible implementation manner of the first aspect, the determining the relevance parameter of the second partial keyword and the third complete keyword includes: calculating the second part of keywords w according to the following formulapA correlation parameter r (w) with the third complete keyword wp,w):
r(wp,w)=α·f(wp,w)·s(wp,w)+β·q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w user feedback weights, α and β being constantAnd (4) counting.
With reference to the third or fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the method further includes: acquiring the number of first documents, wherein the number of the first documents is the number of documents matched with the third complete keyword in the complete keyword set; wherein the generating the first index comprises: generating the first index according to the second part of keywords, the matching type, the third complete keywords, the correlation parameters and the first document number, wherein the attribute value part of the first index also comprises the first document number; the method further comprises the following steps: acquiring the number of second documents corresponding to the first complete keyword according to the first index and the first complete keyword; determining a score of the first document according to the relevance parameter of the first partial keyword and the first complete keyword and the relevance parameter of the first document and the query statement, wherein the determining the score of the first document comprises: and determining the score of the first document according to the number of the second documents, the correlation parameter of the first partial key words and the first complete key words and the correlation parameter of the first document and the query statement.
With reference to the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the generating the first index includes: generating the first index according to the following information:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting the second partial keyword, w representing the third full keyword, f representing the second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with the third full keyword w, n (w) represents the number of documents matched by the third full keyword w in the full keyword set, the left part represents the key value part of the first index, and the right part represents the attribute value part of the first index.
With reference to any one of the first to sixth possible implementation manners of the first aspect, in a seventh possible implementation manner of the first aspect, the second complete keyword is a complete keyword in the complete keyword set, where the length of a character is greater than a threshold L, and n is a positive integer less than or equal to L.
In the scheme, only the complete keywords with the character length exceeding L are subjected to n-tuple segmentation, so that the number of partial keywords in the first index can be effectively controlled, and unnecessary storage overhead is avoided.
With reference to any one possible implementation manner of the first to seventh possible implementation manners of the first aspect, in an eighth possible implementation manner of the first aspect, the obtaining a complete keyword set according to a corpus data set includes: and acquiring the complete keyword set according to the overall coverage rate of the corpus data set and the complete keywords expected to be covered, wherein the number of the complete keywords in the complete keyword set is less than that of the complete keywords in the corpus data set.
According to the scheme, the complete keyword set to be subjected to n-tuple segmentation is determined according to the overall coverage rate of the complete keywords expected to be covered by fuzzy search, the number of partial keywords in the first index can be effectively controlled, and therefore unnecessary storage overhead is avoided.
In a second aspect, an apparatus for information retrieval is provided, where the apparatus includes an input module and a processing module, where the input module is configured to receive a query statement, where the query statement includes a fuzzy keyword, and the fuzzy keyword is a word including a matching character; the processing module is used for determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are the parts of the fuzzy keywords except the matching characters; the processing module is further configured to obtain a first complete keyword according to a first index and the first part of keywords, where the first complete keyword includes the first part of keywords, and the first index includes a key value part and an attribute value part, where information stored in the key value part has a corresponding relationship with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword; the processing module is further configured to obtain a first document according to a second index and the first complete keyword, where the first document is a document where the first complete keyword is located, and the second index includes a corresponding relationship between the first complete keyword and the first document. The apparatus is for implementing the first aspect or a method of information retrieval in any possible implementation manner of the first aspect.
In a third aspect, a computing device is provided that includes a processor and a memory. The computing device is operable to implement the method of information retrieval of the first aspect or any of its possible implementations. Program code for implementing the method of the first aspect or any of its possible implementations may be stored in a memory and executed by a processor.
In a fourth aspect, a storage medium is provided, in which program code stored therein is capable of, when executed, implementing the method of information retrieval in the first aspect or any one of its possible implementations. The program code is constituted by computer instructions implementing the method of information retrieval of the first aspect or any one of its possible implementations.
In a fifth aspect, a computer program product is provided, which may be a software installation package, and when the software installation package is executed by a computer, the method for information retrieval in the first aspect or any possible implementation manner of the first aspect is performed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
Fig. 1 is a schematic diagram of an information retrieval system according to an embodiment of the present invention.
Fig. 2 is another schematic diagram of an information retrieval system according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a retrieval device according to an embodiment of the present invention.
Fig. 4 is a schematic flow chart of a method for information retrieval according to an embodiment of the present invention.
Fig. 5 is another schematic flow chart of the method for information retrieval according to the embodiment of the present invention.
Fig. 6 is a further schematic flow chart of the method for information retrieval according to the embodiment of the present invention.
Fig. 7 is a schematic block diagram of an apparatus for information retrieval according to an embodiment of the present invention.
FIG. 8 is a schematic block diagram of a computing device of an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The application scenario of the embodiment of the invention is a fuzzy search scenario. In the fuzzy search, the query sentence input by the user includes fuzzy keywords, such as AB.
To facilitate understanding and description of embodiments of the present invention, a description will first be given of several terms to which embodiments of the present invention relate.
1) Fuzzy key words
Fuzzy keywords refer to words that include a match in a query statement. A match refers to a character with no actual semantic meaning, such as an asterisk (#) or question mark (. The match is also called a wildcard.
For example, the query statement is AB, CD, and CD is the fuzzy keyword.
2) Part of the keywords
The partial keyword refers to a portion of the fuzzy keyword other than the match.
For example, in the above example, CD is a partial keyword.
3) Complete key word
A complete keyword is relative to a partial keyword and refers to a semantically complete word that includes the partial keyword.
For example, the complete keyword corresponding to the partial keyword "grape" is "grape", and the complete keyword corresponding to the partial keyword "grape" is "grape", "grape trellis" or "grape skin".
It should be noted that there is a matching type between the partial keyword and the complete keyword, which specifically includes matching of a previous word, matching of a next word, or matching of a previous word and a next word. Wherein, the matching of the previous word means that the complete keyword is completely consistent with the ending character of the partial keyword, and the starting character can be inconsistent; the matching of the later words means that the initial characters of the complete keywords and the partial keywords are completely consistent, and the final characters can be inconsistent; the matching of the words comprises the following three cases: 1) the complete keyword and the ending characters of partial keywords are completely consistent, and the starting characters can be inconsistent; 2) the complete keyword and the initial character of partial keyword are completely consistent, and the final character can be inconsistent; 3) the complete keyword and the partial keyword have completely consistent intermediate characters, the start characters of the complete keyword and the partial keyword are different, and the end characters of the complete keyword and the partial keyword are also different. That is, the same partial keyword corresponds to different complete keywords for different matching types.
Specifically, for example, there are several full semantic words as follows: ABC, BCD, ABCDE. For partial keywords BC, when the matching type is preceding word matching (i.e. the corresponding fuzzy keyword is BC), the corresponding complete keyword is ABC; when the matching type is the post word matching (namely the corresponding fuzzy keyword is BC), the corresponding complete keyword is BCD; when the matching type is front and rear word matching (namely the corresponding fuzzy keyword is BC), the corresponding complete keyword comprises ABCD, ABC and BCD.
Fig. 1 is a schematic diagram of an information retrieval system 100 according to an embodiment of the present invention. Information retrieval system 100 includes a retrieval device 110 and a storage device 120. The storage device 120 stores data required for retrieval by the information retrieval system 100, specifically, data such as an index file, a file library, a historical query statement, a historical query log, and a word feature library shown in fig. 1. Storage device 120 may establish communication with retrieval device 110 through communication network 130; the storage device 120 may also be provided directly in the retrieval device 110. The retrieval device 110 includes an input/output unit and a processing unit, after a user sends a query sentence to the retrieval device 110 through the input/output unit 111, the retrieval device 110 performs information retrieval based on the query sentence through the processing unit, and returns a retrieval result to the user through the input/output unit 111. The search results of an information search system are typically presented to a user through a series of documents. The input output unit may be a network interface if the user sends the query statement to the retrieval device 110 through the communication network 130. The Input output unit may also be an Input/output (I/O) interface of the retrieval device 110, if the user sends the query statement to the retrieval device 110 locally at the retrieval device 110.
FIG. 2 is a schematic diagram of another architecture of the information retrieval system 100. Information retrieval system 100 includes one or more retrieval devices 110 and also includes one or more storage devices 120. Communication between each retrieval device 110 and each storage device 120 is accomplished through a communication network 130. The repository of documents, the repository of index documents, the historical query statements, the historical query logs, the repository of word features, etc. of the information retrieval system 100 may be distributed among the storage devices 120. One or more retrieval devices 110 may constitute a distributed computing system that processes query statements. When the number of query statements to be processed is large, that is, the load of the information retrieval system 100 is high, the information retrieval system 100 can allocate the tasks to be processed to different retrieval devices 110 for execution, so as to improve the parallel processing capability of the information retrieval system 100.
Fig. 3 is a schematic structural diagram of the retrieval device 110 according to an embodiment of the present invention, where the retrieval device 110 includes:
and an input and output unit 111, configured to receive the query statement input by the user and send the query statement to the query statement processing unit 112, and further configured to output a query result returned to the result processing unit 115 to the user.
The query sentence processing unit 112 is configured to parse the query sentence, obtain a keyword in the query sentence, and send the keyword to the fuzzy search processing unit 113.
Specifically, the processing of the query statement by the query statement processing unit 112 includes: word segmentation processing, stop word filtering processing or conversion processing and the like. Specifically, the conversion processing includes conversion processing such as synonym conversion, temporal conversion (mainly used for english, for example, converting sitting into sit), and the like.
A fuzzy search processing unit 113 for identifying fuzzy keywords from the keywords transmitted from the query sentence processing unit 112, transmitting a part of the fuzzy keywords to the first indexing unit 114, and transmitting a complete keyword from the keywords transmitted from the query sentence processing unit 112 to the second indexing unit 115.
Specifically, the keywords sent by the query sentence processing unit 112 are CD and AB, and the fuzzy search processing unit 113 sends part of the keywords AB in the fuzzy keywords AB to the first indexing unit 114 and sends the complete keywords CD to the second indexing unit 115.
The first indexing unit 114 includes a first index, and the first index includes a correspondence relationship between partial keywords and full keywords. The first indexing unit 114 is configured to determine a complete keyword where the partial keyword sent by the fuzzy search processing unit 113 is located based on the first index, and return the complete keyword to the fuzzy search processing unit 113.
Specifically, the Key Value (Key) portion of the first index includes a partial Key, and the attribute Value (Value) portion of the first index includes a full Key.
The fuzzy search processing unit 113 is further configured to send the complete keyword to the second indexing unit 114.
The second indexing unit 115 includes a second index, where the second index includes a correspondence between the complete keyword and the document where the complete keyword is located. The second indexing unit 115 is configured to determine, based on the second index, a document in which the complete keyword sent from the fuzzy search processing unit 113 is located, and send information indicating the document to the returned result processing unit 116.
Specifically, the second indexing unit 115 is configured to search for a document matched with the complete keyword, calculate a relevance score between the document and the query statement, score the document, and return the document with the score top N to the returned result processing module 116.
It is to be understood that the retrieval device 110 may include one or more second indexing units 115.
And the return result processing unit 116 is configured to process the information sent by the second indexing unit 115, including sorting, typesetting, and the like.
In the case that there are a plurality of second indexing units 115, the returned result processing unit 116 performs global ranking after receiving all the documents returned by the second indexing units 115. After the globally arranged document is obtained, according to the requirement of final display, the information related to the document is taken and spliced into a final information complete set to be displayed; and fine-tuning is performed according to the final presentation requirement, for example, the number of documents from the same source cannot exceed three. And finally, returning the ordered document result to the user through the input and output unit 111.
Fig. 4 is a schematic flow chart of a method 200 for information retrieval according to an embodiment of the present invention, where the method 200 may be executed by the retrieval device 110 shown in fig. 1, fig. 2, or fig. 3, for example, and the method 200 includes:
a query statement is received 210, the query statement including fuzzy keywords, the fuzzy keywords being words containing the matchers.
Specifically, a query sentence of a User is input to a search engine background through a User Interface (UI), and first, a word segmentation process is performed, for example, a word segmenter is used to perform word segmentation on the query sentence, so that a list including words and word frequencies is generated. And filtering stop words, namely stop words, which refer to words in the sentence which do not directly influence or slightly influence the expression of the sentence, for example, words which are not helpful for searching out related files in the query sentence input by the user. After word segmentation processing and stop word filtering processing are carried out on the query sentence, the keywords in the query sentence are obtained, and then the fuzzy keywords are determined. For example, if the keywords in the obtained query sentence are AB, CD, and EF, the fuzzy keyword is EF.
220, determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are parts of the fuzzy keywords except for the matching characters.
And 230, acquiring a first complete keyword according to the first index and the first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, the information stored in the key value part has a corresponding relation with the information stored in the attribute value part, the first part of keywords are stored in the key value part, and the first complete keyword is stored in the attribute value part.
It should be understood that one or more complete keywords including the first partial keyword may be obtained according to the first index and the first partial keyword, and for convenience of understanding and description, the embodiment of the present invention is described by taking the first complete keyword as an example, in other words, the first complete keyword represents any one of a plurality of complete keywords including the first partial keyword obtained according to the first index and the first partial keyword.
The first index is an index built for a portion of the keywords. Based on the first index, the complete keywords corresponding to part of the keywords can be quickly acquired. Specifically, the specific form of the first index is shown in table 1:
TABLE 1
Key value (Key) | Attribute Value (Value) |
Part of the keywords 1 | Complete keywords 1a,1b, … |
Part of the keywords 2 | Complete keywords 2a,2b, … |
As can be seen from table 1, the first index includes a key value portion and an attribute value portion, the key value portion stores a partial key, and the attribute value portion stores a complete key. Starting from the second row of table 1, the partial keywords of each row have a correspondence with the full keywords of the present row. For example, partial keyword 1 corresponds to full keywords 1a,1b, …; partial keyword 2 corresponds to full keyword 2a,2b, ….
It should be understood that the complete keyword corresponding to one partial keyword may include one or more, and it is not convenient to enumerate in table 1, and only two complete keywords are schematically shown, but do not limit the scope of the present invention.
It should also be understood that table 1 is only an example and not a limitation, and the specific form of the first index may also be other forms besides a table, such as a database, a file, xml, json or a self-defined data structure, etc., which is not limited by the embodiment of the present invention.
And 240, acquiring a first document according to the second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document.
Specifically, the second index also includes a key value portion and an attribute value portion, the key value portion of the second index stores the complete keyword, and the attribute value portion of the second index stores information indicating a document where the complete keyword is located. The second index may correspond to an index file in an existing information retrieval scheme, and specific content is in the prior art and is not described herein again.
In the embodiment of the invention, the first index comprises the corresponding relation between the partial keywords and the complete keywords, and the complete keywords corresponding to the partial keywords can be found more quickly through the first index, so that the embodiment of the invention can improve the overall efficiency of information retrieval in a fuzzy search scene.
It should be appreciated that embodiments of the present invention generate the first index already before the first index is utilized to obtain the first full key at 230. Specifically, it may be system pre-generation. It should also be understood that the first index may be updated or modified periodically, which is not limited by the embodiments of the present invention.
Optionally, as shown in fig. 5, as an embodiment, in the embodiment shown in fig. 4, the method 200 further includes:
and 250, acquiring a complete keyword set according to the corpus data set.
Specifically, the corpus data set includes a document data set and a query sentence set, wherein the document data set includes documents searched by a user, and the query sentence set includes query sentences searched by the user in historical time. And performing word segmentation processing and stop word filtering processing on the document data set and the query sentence set respectively to obtain corresponding word and word frequency lists, and then obtaining the complete keyword set based on the two lists.
And 260, performing n-tuple segmentation on a second complete keyword in the complete keyword set, and determining the word after the n-tuple segmentation as a second part of keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword.
For example, 2-tuple segmentation is performed on the complete keyword ABCD to obtain partial keywords AB, BC and CD. For another example, 3-tuple segmentation is performed on the complete keyword ABCD to obtain partial keywords ABC and BCD.
It should be understood that the second full keyword in this embodiment indicates each full keyword in the full keyword set.
And 270, determining a third complete keyword, wherein the third complete keyword is a word in the complete keyword set, and the third complete keyword comprises the second part of keywords.
For example, the complete keyword set includes the following complete keywords ABCD, ABD, ABE, and BCD. 2-tuple segmentation is performed on the ABCD to obtain a partial keyword AB, and a third complete keyword corresponding to the partial keyword AB comprises: ABCD, ABD and ABE.
280, generating a first index according to the second part of keywords and the third complete keywords, wherein the key value part of the first index comprises the second part of keywords, and the attribute value part of the first index comprises the third complete keywords. Specifically, the first index is as shown in table 1.
As already mentioned above, there are matching types between partial keywords and complete keywords, and the same partial keyword corresponds to different complete keywords for different matching types. In order to match the partial keywords and the complete keywords in a finer granularity manner, the embodiment of the invention further provides that the matching types of the partial keywords are stored in the attribute value part of the first index, so that the first index comprises the corresponding relation among the partial keywords, the matching types and the complete keywords.
Optionally, as an embodiment, determining 270 a third full keyword includes:
and determining a third complete keyword based on the matching type, wherein the third complete keyword is matched with the second part of keywords based on the matching type, and the matching type is any one of front word matching, rear word matching or front and rear word matching.
Wherein the generating 280 the first index comprises: and generating a first index according to the second part of keywords, the matching type and the third complete keywords, wherein the attribute value part of the first index also comprises the matching type.
Specifically, the first index generated in the embodiment of the present invention is shown in table 2:
TABLE 2
As can be seen from table 2, the first index includes a key value portion and an attribute value portion, the key value portion stores a partial keyword, and the attribute value portion stores a matching type and a complete keyword, that is, the first index indicates a correspondence between the partial keyword, the matching type, and the complete keyword. For example, partial keyword 1, previous word match, and complete keyword 11a,11b, … have a correspondence relationship therebetween; the partial keyword 2 and the postword match have a corresponding relationship with the complete keywords 22a,22b and …; there is a correspondence between the partial keyword 2, the pre-and post-word match and the full keyword 23a,23b, ….
It should be understood that the complete keyword corresponding to one partial keyword may include one or more, and it is not convenient to enumerate in table 2, and only two complete keywords are schematically shown, but do not limit the scope of the present invention.
It should also be understood that table 2 is only an example and not a limitation, and the specific form of the first index may also be other forms besides a table, such as a database, a file, xml, json or a self-defined data structure, etc., which is not limited by the embodiment of the present invention.
In the embodiment of the invention, under the condition that the matching types of the partial keywords and the partial keywords are known, the corresponding complete keywords can be obtained according to the first index.
Wherein, the step 230 of obtaining a first complete keyword according to the first index and the first partial keyword comprises:
and acquiring a first complete keyword according to the first index, the first part of keywords and the matching type of the first part of keywords, wherein the matching type of the first part of keywords is determined according to the position relation between the first part of keywords and the matching character.
For example, the first part of keywords are AB in the fuzzy keywords AB, and the matching type of the first part of keywords is the post-word matching. For another example, if the first part of keywords are CD in the fuzzy keyword CD, the matching type of the first part of keywords is the prefix matching. For another example, the first part of keywords is BC in the fuzzy keywords BC, and the matching type of the first part of keywords is front and back word matching.
Optionally, as an embodiment, in the embodiment shown in fig. 5, the method 200 further includes: and determining the correlation parameters of the second part of keywords and the third complete keywords according to the occurrence frequency of the second part of keywords in the complete keyword set and the occurrence frequency of the third complete keywords in the complete keyword set.
Optionally, as an embodiment, the second part of keywords w is calculated according to the following formulapCorrelation parameter r (w) with third complete keyword wp,w):
r(wp,w)=α*f(wp,w)*sN(wp,w)+β*q(wp,w) (1)
Wherein, f (w)pW) represents wpConditional probability of w, sN(wpW) represents wpTightness parameter with w, q (w)pW) represents wpThe user feedback weights, α and β, with w being constant, may be preconfigured.
Wherein the generating 280 the first index comprises: and generating a first index according to the second part of keywords, the matching type, the third complete keywords and the correlation parameters, wherein the attribute value part of the first index also comprises the correlation parameters of the second part of keywords and the third complete keywords.
Specifically, the first index generated by the embodiment of the present invention is shown in table 3:
TABLE 3
As can be seen from table 3, the first index includes a key value portion and an attribute value portion, the key value portion stores a partial keyword, and the attribute value portion stores a matching type, a complete keyword, and a correlation parameter, i.e., the first index indicates a correspondence relationship between the partial keyword, the matching type, the complete keyword, and the correlation parameter. For example, the partial keyword 1, the previous word match, and the complete keyword 11a have a corresponding relationship, and the correlation parameter between the partial keyword 1 and the complete keyword 11a is r 1; the partial keyword 2, the previous and next word matching and the complete keyword 23a have a corresponding relationship, and the correlation parameter between the partial keyword 2 and the complete keyword 23a is r 11.
It should be understood that the complete keyword corresponding to one partial keyword may include one or more, and it is not convenient to enumerate in table 3, and only two complete keywords are schematically shown, but do not limit the scope of the present invention.
It should also be understood that table 3 is only an example and not a limitation, and the specific form of the first index may also be other forms besides a table, such as a database, a file, xml, json or a self-defined data structure, etc., which is not limited by the embodiment of the present invention.
It should be understood that in the embodiment shown in fig. 4, after the first document where the complete keyword is located is obtained, the score of the first document needs to be calculated, and the order of presenting the first document to the user is sequentially determined. In the prior art, a score for a first document is typically calculated based on relevance parameters of the first document to a query statement. The embodiment of the invention provides that when calculating the score of the first document, the relevance parameters between partial keywords and complete keywords are also considered, and the relevance of fuzzy search can be improved compared with the prior art.
Optionally, as an embodiment, in the embodiment shown in fig. 4 or fig. 5, the method 200 further includes:
acquiring correlation parameters of the first part of keywords and the first complete keywords according to the first index, the first part of keywords and the first complete keywords;
and determining the score of the first document according to the correlation parameters of the first partial keywords and the first complete keywords and the correlation parameters of the first document and the query statement.
In the embodiment of the invention, when the scoring of the document where the complete keyword is located is determined, the correlation parameters between the partial keyword and the complete keyword are considered, so that the correlation of the retrieval result of the information retrieval in the fuzzy search scene can be improved.
Optionally, as an embodiment, in the embodiment shown in fig. 5, the method 200 further includes:
acquiring the number of first documents, wherein the number of the first documents is the number of documents matched with the third complete keyword in the complete keyword set;
wherein, 280, generating the first index comprises: and generating a first index according to the second part of keywords, the matching type, the third complete keywords, the correlation parameters and the first document number, wherein the attribute value part of the first index also comprises the first document number.
Specifically, the first index in the present embodiment is represented in table 4:
TABLE 4
As can be seen from table 4, the first index includes a key value portion and an attribute value portion, the key value portion stores partial keywords, and the attribute value portion stores matching types, complete keywords, relevance parameters, and document numbers, that is, the first index indicates correspondence between the partial keywords, the matching types, the complete keywords, the relevance parameters, and the document numbers. For example, the partial keyword 1, the previous word match, and the complete keyword 11a have a corresponding relationship, the correlation parameter between the partial keyword 1 and the complete keyword 11a is r1, and the number of documents matched by the complete keyword 11a is 1; the partial keyword 2, the matching of the preceding and following words and the complete keyword 23a have a corresponding relationship, the correlation parameter between the partial keyword 2 and the complete keyword 23a is r11, and the number of documents matched by the complete keyword 23a is the number of documents 11.
It should be understood that the complete keyword corresponding to one partial keyword may include one or more, and it is not convenient to enumerate in table 4, and only two complete keywords are schematically shown, but the scope of the present invention is not limited thereto.
It should also be understood that table 4 is only an example and not a limitation, and the specific form of the first index may also be other forms besides a table, such as a database, a file, xml, json or a self-defined data structure, etc., which is not limited by the embodiment of the present invention.
Optionally, as an embodiment, in the embodiment shown in fig. 4 or fig. 5, the method 200 further includes:
acquiring the number of second documents corresponding to the first complete keyword according to the first index and the first complete keyword;
determining the score of the first document according to the relevance parameters of the first part of keywords and the first complete keywords and the relevance parameters of the first document and the query statement, wherein the determining the score of the first document comprises the following steps:
and determining the score of the first document according to the number of the second documents, the correlation parameters of the first partial keywords and the first complete keywords and the correlation parameters of the first document and the query statement.
In the embodiment of the invention, when determining the score of the document where the complete keyword is located, the relevance parameters between the partial keyword and the complete keyword are considered, and the number of documents matched with the complete keyword is also considered, so that the relevance of the retrieval result of the information retrieval in the fuzzy search scene can be improved.
In the embodiment of the invention, the first index comprises the corresponding relation between the partial keywords and the complete keywords, and the complete keywords corresponding to the partial keywords can be found more quickly through the first index, so that the embodiment of the invention can improve the overall efficiency of information retrieval in a fuzzy search scene.
It should be understood that the embodiments shown in fig. 4 or fig. 5 may be performed by the retrieval device 110 shown in fig. 3. Specifically, step 210 in the embodiment shown in fig. 4 is performed by the input/output unit 111 and the query statement processing unit 113, step 220 is performed by the query statement processing unit 113 and the fuzzy search processing unit 113, step 230 is performed by the fuzzy search processing unit 113 and the first indexing unit 114, step 240 is performed by the fuzzy search processing unit 113 and the second indexing unit 115, and step 250 and 280 are performed by the first indexing unit 114.
Optionally, as an embodiment, when executing step 220, the fuzzy search processing unit 113 further includes:
after receiving the keywords issued by the query sentence processing unit 113, the processing is differentiated according to whether the inside includes the fuzzy keyword.
Specifically, if the keywords sent by the query sentence processing unit 113 include fuzzy keywords, for example, the fuzzy keywords include front word fuzzy matching (e.g. AB), rear word fuzzy matching (e.g. AB), or front and rear word matching (e.g. AB), the fuzzy search processing unit 113 determines which type of fuzzy matching is the fuzzy keyword in the front word/rear word/front and rear words according to the position of the matching character.
For example, if the fuzzy keyword is AB, then the partial keyword is determined to be AB, and the matching type is pre-word matching. And if the fuzzy keyword is AB, determining that part of keywords are AB, and the matching type is the matching of the later words. And if the fuzzy keyword is AB, determining that part of keywords are AB, and the matching type is front and rear word matching.
The different types may determine the specific content sent by the fuzzy search processing unit 113 to the first indexing unit 114.
If the keywords issued by the query sentence processing unit 113 include the complete keyword, the complete keyword is directly sent to the second indexing unit 115 for processing.
Taking the keyword sent by the query processing unit 113 as CD AB, the fuzzy search processing unit 113 will send the keyword CD directly to the second indexing unit 115 for processing, and if the fuzzy keyword AB is determined as the backward word fuzzy match, send part of the keyword AB to the first indexing unit 114, and indicate that the keyword is the backward word match (i.e. only part of the Bf in the first index is used for querying).
The first indexing unit 114 finds a complete keyword corresponding to the partial keyword AB, a correlation parameter between the partial keyword and the complete keyword, and the number of documents matched by the complete keyword based on the first index (as shown in table 4) according to the input partial keyword (e.g., AB) and the matching type (e.g., post-word matching), and then returns these information to the fuzzy search processing unit 113.
Optionally, as an embodiment, when executing step 240, the fuzzy search processing unit 113 further includes:
after receiving all the complete keywords (e.g., ABC), the correlation parameters (e.g., r (AB, ABC)), and the number of matched documents (e.g., n (ABC)), the complete keywords and the correlation coefficients thereof that need to be considered in an important manner are filtered out by the following rules and sent to the second indexing unit 115:
sorting the whole word information according to the relevance of the whole words, calculating the document coverage rate of Top N words according to the number of the whole word matching documents and the calculation formula of the CoverRatio, and then determining which whole words need to be sent and a relevance coefficient according to the threshold value of the coverage rate;
sometimes, the number of documents corresponding to some whole words is more than x times (x is defined by itself, for example, 50 times) of the number of documents corresponding to other whole words, and in order to ensure the diversity of the matched whole words, we can make a truncation on the number of documents matched by the whole words, for example, the number of documents is an average number of documents, and then recalculate the document coverage rate of top N words for the first rule. An example is given below in terms of part AB, as shown in table 5:
TABLE 5
Since 1000 is 100 times 10 and 50 times 20, the average number of documents is taken
Instead of 1000, (1000+10+20)/3 ═ 343, recalculated in table 5 above to obtain the coverage values shown in table 6 for the rule inputs of the first step.
TABLE 6
When the coverage rate is 93%, ABC and ABD are sent to the second index unit 115 after the processing shown in table 6.
Optionally, in this embodiment of the present invention, the second complete keyword is a complete keyword in the complete keyword set, where the length of the character is greater than a threshold L, and n is a positive integer less than or equal to L.
For a detailed description, see the description of step 340 below.
Optionally, in this embodiment of the present invention, the obtaining 250 a complete keyword set according to the corpus data set includes:
and acquiring a complete keyword set according to the overall coverage rate of the corpus data set and the complete keywords expected to be covered, wherein the number of the complete keywords in the complete keyword set is less than that of the complete keywords in the corpus data set.
For a detailed description, see the description of step 330 below.
For better understanding of the embodiment of the present invention, the following describes a process of generating the first index in conjunction with fig. 6, as shown in fig. 6, including:
a document dataset and a query statement set are obtained 310.
A document dataset is a collection of all documents to be binned by a search engine for searching by a user. A set of query statements is a set of query statements used by a user on a search engine over a historical period of time.
And 320, performing word segmentation processing on the document data set and the query statement set.
Specifically, each document in the document data set is segmented by a segmenter, and each document generates a list of words and word frequencies.
For example, a document is "green grape, purple grape, green grape has no purple grape, grape has no grape skin spitting backward", and the word and word frequency list generated by using the word segmenter to segment the document is shown in table 7:
TABLE 7
The process of performing word segmentation processing on the query sentence set is similar to the process of performing word segmentation processing on the document data set, and is not described herein again.
330, determining a complete keyword set. Step 330 may also be referred to as accent word mining.
Since the difference between the document data set and the query statement set is large, the two data sets are processed separately.
1) And processing the document data set.
According to the word and word frequency list (hereinafter referred to as the first word frequency table) of the document data set obtained by the word processing in step 320, the importance index of each word in the first word frequency table is calculated by using a TF-IDF method or other feasible methods, and then, the words in the first word frequency table are sorted in a descending order according to the importance index, and for convenience of description, the first word frequency table sorted in the descending order is referred to as a second word frequency table. Note that TF means Term Frequency (Term Frequency), and IDF means Inverse Document Frequency (Inverse Document Frequency). In a given document, term frequency refers to the number of times a given term appears in the document. The word frequency is typically normalized to prevent it from being biased towards long documents. Reverse file frequency is a measure of the general importance of a word. For example, the inverse file frequency for a particular term may be determined by dividing the total number of documents by the number of documents that contain the term, and taking the logarithm of the resulting quotient. The TF-IDF method is used for filtering out common words and keeping important words. The TF-IDF method is prior art and the details are not described in detail.
The words of the top N rank in the second word frequency table can be obtained according to the overall coverage rate (CoverRatio)1 of the key words to be covered by the fuzzy search, and used for key word segmentation.
Specifically, the following formula is used for determining that the top-ranked N words in the second word frequency table are used for key word segmentation:
CoverRatio1=(x1+x2+…+xN)/(x1+x2+…+xtotalNum1) (2)
where CoverRatio1 denotes overall coverage, xiIndicates the word frequency of the word ranked at the ith position in the second word frequency table, i.e., the number of times the word at the ith position appears in the document data set, and totalNUM1 is the total number of different words in the document data set.
It should be understood that the overall coverage is a percentage value pre-configured by the system, and in practical applications, the overall coverage can be determined according to the number of key words to be covered.
For convenience of the following description, a table formed by the top N words in the second word frequency table and the corresponding word frequencies is referred to as an emphasized word list 1.
2) And processing the query statement set.
A list of words and word frequencies (hereinafter referred to as a third word frequency table) of the query sentence set obtained by the word processing in step 320. And sorting the words in a descending order according to the word frequency of each word in the third word frequency table, and for convenience of description below, recording the third word frequency table which is sorted in the descending order as a fourth word frequency table.
The method can be determined according to the overall coverage rate (CoverRatio)2 of the key words to be covered by the fuzzy search, and the word of the top M in the fourth word frequency table is taken for splitting the key words.
Specifically, determining to use a word M before ranking in the fourth word frequency table for key word segmentation by using the following formula:
CoverRatio2=(y1+y2+…+yM)/(y1+y2+…+ytotalNum2) (3)
where CoverRatio2 denotes overall coverage, yiRepresenting the word frequency of the j-th word in the fourth word frequency table, namely the frequency of the i-th word in the query sentence set; totalNUM2 is the total number of different words in the query statement set.
For convenience of the following description, a table formed by the M-top ranked words and the corresponding word frequencies in the fourth word frequency table is referred to as a highlight list 2.
It should be understood that the overall coverage (CoverRatio)1 and the overall coverage (CoverRatio)2 may be the same or different, and the embodiment of the present invention is not limited thereto.
Combining the key word list 1 and the key word list 2 to obtain a complete keyword set, wherein the specific combination mode is that the same word is combined into a line, the overall word frequency of the word is the word frequency of the word in the key word list 1 plus α times of the word frequency of the word in the key word list 2, α is a configurable parameter, and α is used for representing the relative importance of the document data set and the query sentence set.
Hereinafter, each word included in the complete keyword set is referred to as a complete keyword.
Specifically, the complete keyword set is shown in table 8:
TABLE 8
Complete key word | Word frequency of complete keywords |
ABCD | 20 |
ABCE | 30 |
ABCER | 50 |
340, carrying out full segmentation on the complete keyword set to obtain partial keywords.
And further subdividing the complete keywords of which the character lengths are larger than a threshold value L in the complete keyword set. L is usually 3.
For example, a complete keyword is regarded as a character string, all n-tuples of the character string are taken (three scenes of AB, AB), an n-tuple beginning with the first character of the character string may be taken (only a scene of AB may be supported at this time), an n-tuple ending with the last character of the character string may be taken (only a scene of AB may be supported at this time), and n is a positive integer smaller than L.
For example, 3-tuple segmentation is performed on the complete keyword ABCD to obtain partial keywords: ABC and BCD. It follows that a first index with these partial keywords as Key values (keys) is generated to support fuzzy search.
In order to calculate the correlation parameter between the partial keyword and the complete keyword in the following step 350, the complete keyword needs to be fully segmented in this step, for example, the fully segmented n-tuple is obtained after fully segmenting the complete keyword ABCD: a, B, C, D, AB, BC, CD, ABC, BCD, ABCD. The frequency of the fully-segmented n-tuple is counted, and only the word frequency of the complete keyword where the n-tuple is located needs to be added during counting, for example, if the 3-tuple ABC is respectively located in the complete keyword ABCD (word frequency is 20), the complete keyword ABCE (word frequency is 30), and the complete keyword ABCER (word frequency is 50), the word frequency of the 3-tuple ABC is 20+30+50 — 100.
And a table obtained after full segmentation is carried out on the complete keywords with the character length larger than the threshold value L in the complete keyword set is called a full segmentation information table, and the full segmentation information table comprises n-tuple and word frequency. It should be understood that in the embodiment of the present invention, the n-tuple and the partial keyword are equivalent concepts.
Specifically, the complete keyword set is exemplified as table 8, and a part of the full-segmentation information table after performing full segmentation on each complete keyword is shown in table 9:
TABLE 9
Partial key word (n-tuple) | Word frequency of partial keywords |
AB | 100 |
BC | 100 |
CD | 20 |
350, calculating a correlation parameter between the partial keyword and the complete keyword.
First, using the complete keyword set (as shown in table 8) and the complete segmentation information table (as shown in table 9) obtained in the above steps, the probability p (w) of each complete keyword w in the complete keyword set and the probability w of each partial keyword w in the complete keyword set can be calculatedpProbability of occurrence p (w) in the complete set of keywordsp)。
Specifically, p (w) is equal to the word frequency of the complete keyword w divided by the sum of all the word frequencies in the complete keyword set (as shown in table 8) and the full segmentation information table (as shown in table 9); p (w)p) Is equal to part of the keyword wpDivided by the sum of all the word frequencies in the complete keyword set (as shown in table 8) and the full segmentation information table (as shown in table 9).
Then, several indices are calculated as follows:
1) conditional probability f (w)p,w)
f(wp,w)=p(w)/p(wp) (4)
Conditional probability f (w)pW) characterizes the complete keyword w relative to the partial keywords wpThe importance of all the complete keywords is an index for looking at the overall importance.
2) Compactness s (w)p,w)
s(wp,w)=p(w)/p(wbp)p(wp)p(wap) (5)
Wherein, p (w)bp) And p (w)ap) Respectively representing partial keywords w in the complete keywords wpThe probability of occurrence of the word segments of the preceding and following parts.
Compactness s (w)pW) characterizing a portion of the keyword wpImportance within the complete keyword w.
Due to s (w)pW) relative to wpInstead of a normalized index, the pair s (w) is requiredpW) normalization is performed as follows:
wherein, wiIndicating that the complete keyword set contains partial keywords wpN1 is that the complete keyword set contains partial keywords wpThe number of complete keywords.
3) User feedback weight q (w)p,w)
User feedback weight q (w)pW) for characterizing a user input including a partial keyword wpThe probability of the complete keyword w that the user will be finally found under the fuzzy query condition of (1).
Specifically, q (w) is calculated using the search logpW). The search log comprises query sentences of the user, query sentence analysis information, detail information in each module which is issued by the analyzed query sentences to the rear end of the search engine for processing, finally returned document information, information of the search results clicked by the user and the like.
Generally, when a user clicks a certain document and then does not click any other document, the certain document is considered as a document desired by the user; when the user clicks a certain document for s seconds (s is generally 30) and then clicks other documents, the certain document is considered to be the document desired by the user. By utilizing the two rules, log information containing a document which a user wants to find can be filtered from all the fuzzy search logs, so that mapping information of the fuzzy query statement and the document can be obtained.
For example, when the fuzzy query statement is AB, the documents Doc1 and Doc2 are intended by the user. It should be understood that in this fuzzy query, AB is a partial keyword wp。
Using the complete keyword w beginning with AB in Doc11Importance indicators in Doc1 to define user feedback weightsA value of (d); using the complete keyword w beginning with AB in Doc21Importance index in Doc2User feedback weightsIf there are also complete keywords w starting with AB in Doc1 and Doc22Obtained by the same wayAndthe value of (c).
By the method, the corresponding user feedback indexes can be mined from the logs of each fuzzy search. Finally, the user feedback indexes obtained by mining all the fuzzy search logs are integrated to obtain the final user feedback index q (w)p,w)。
Wherein x traverses all the subsumption wpY1 traverse all document indices, y, containing w1xTraverse all containers wxDocument subscript of cy1Indicating that there is Doc in all wp fuzzy queriesy1Number of times of document satisfaction, cyxIndicating that there is Doc in all wp fuzzy queriesyxTo satisfy the number of documents. Q (w) can be seen by definitionpW) is relative to wpThe normalization index of (1).
Finally, calculating partial keywords w according to the three indexespCorrelation parameter r (w) with complete keyword wpW), and r (w) can be specifically calculated by either one of the following two formulasp,w):
r(wp,w)=α·f(wp,w)·s(wp,w)+β·q(wp,w) (1)
r(wp,w)=α·f2(wp,w)·s(wp,w)+β·q(wp,w) (8)
The above formula is by way of example only and not by way of limitation, e.g., f (w)pW) can also be 3, where the coefficients α and β are configurable.
360, a first index is generated.
And establishing a first index by taking the partial keywords (n-tuples) as Key values (keys), wherein the attribute values of the first index comprise the matching types of the partial keywords, the complete keywords where the partial keywords are located, correlation parameters between the partial keywords and the complete keywords, and the number of documents where the complete keywords are located. Specifically, as shown in table 4.
Specifically, the first index is generated according to the following information:
wp=f:w,r(wp,w),n(w) (9)
wherein, wpRepresenting a second partial keyword, w representing a third full keyword, f representing a second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with respect to the third full keyword w, n (w) indicates the number of documents in the full keyword set that the third full keyword w matches, where the left part indicates the key value part of the first index, and the right part indicates the attribute value part of the first index.
Wherein the expansion of equation (9) may be:
wherein Bf represents that the matching type of partial keywords is the matching of the previous words,representing part of a keyword wpThe complete keyword matched by the front word, Af represents that the matching type of partial keywords is the matching of the rear word,representing part of a keyword wpThe complete keyword to which the back word is matched, ABf indicates that the match type of the partial keyword is a back word match,representing part of a keyword wpThe complete keyword matched by the later word.
It should be understood that,comprises thatAndand also includespA complete word that strictly matches before and after words, for example, a complete keyword that matches before and after words of the partial keyword BC includes: ABC, BCD and ABCD.
It should also be understood that the above formula for expressing the first index is only an example and is not limited, that is, only an expression, and when the expression is specifically implemented, the expression may be implemented by xml, json, or a data structure defined by itself, which is not limited in this embodiment of the present invention.
Fig. 7 is a schematic block diagram of an apparatus 600 for information retrieval according to an embodiment of the present invention, the apparatus 600 including:
an input module 610, configured to receive a query statement, where the query statement includes a fuzzy keyword, and the fuzzy keyword is a word including a matching character;
the processing module 620 is configured to determine a first part of keywords according to the fuzzy keywords, where the first part of keywords are parts of the fuzzy keywords except for the matching characters;
the processing module 620 is further configured to obtain a first complete keyword according to the first index and the first part of keywords, where the first complete keyword includes the first part of keywords, the first index includes a key value part and an attribute value part, where information stored in the key value part has a corresponding relationship with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword;
the processing module 620 is further configured to obtain a first document according to the second index and the first complete keyword, where the first document is a document where the first complete keyword is located, and the second index includes a corresponding relationship between the first complete keyword and the first document.
In the embodiment of the invention, the first index comprises the corresponding relation between the partial keywords and the complete keywords, and the complete keywords corresponding to the partial keywords can be found more quickly through the first index, so that the embodiment of the invention can improve the overall efficiency of information retrieval in a fuzzy search scene.
Optionally, as an embodiment, the processing module 620 is further configured to obtain a complete keyword set according to the corpus data set; performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword; determining a third complete keyword, wherein the third complete keyword is a word containing the second part of keywords in the complete keyword set; and generating a first index according to the second part of keywords and the third complete keywords, wherein the key value part of the first index comprises the second part of keywords, and the attribute value part of the first index comprises the third complete keywords. Specifically, the first index is as shown in table 1.
Optionally, as an embodiment, the processing module 620 is specifically configured to determine a third complete keyword based on the matching type, where the third complete keyword is matched with the second partial keyword based on the matching type, and the matching type is any one of a preceding word matching, a following word matching, or a preceding and following word matching;
the processing module 620 is specifically configured to generate a first index according to the second part of keywords, the matching type, and the third complete keyword, where an attribute value part of the first index further includes the matching type;
the processing module 620 is specifically configured to obtain a first complete keyword according to the first index, the first partial keyword, and the matching type of the first partial keyword, where the matching type of the first partial keyword is determined according to a position relationship between the first partial keyword and the matching symbol.
Specifically, the first index is as shown in table 2.
Optionally, as an embodiment, the processing module 620 is further configured to determine a correlation parameter between the second partial keyword and the third complete keyword according to an occurrence frequency of the second partial keyword in the complete keyword set and an occurrence frequency of the third complete keyword in the complete keyword set;
the processing module 620 is specifically configured to generate a first index according to the second part of keywords, the matching type, and the third complete keywords and the correlation parameters, where an attribute value part of the first index further includes the correlation parameters of the second part of keywords and the third complete keywords;
the processing module 620 is further configured to obtain a correlation parameter between the first partial keyword and the first complete keyword according to the first index;
the processing module 620 is further configured to determine a score for the first document according to the relevance parameter of the first partial keyword and the first complete keyword, and the relevance parameter of the first document and the query statement.
Specifically, the first index is as shown in table 3.
Optionally, as an embodiment, the processing module 620 is specifically configured to calculate the second partial keyword w according to the following formulapCorrelation parameter r (w) with third complete keyword wp,w):
r(wp,w)=α·f(wp,w)·s(wp,w)+β·q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w, α and β are constants.
Optionally, as an embodiment, the processing module 620 is further configured to obtain a first number of documents, where the first number of documents is a number of documents matched by the third complete keyword in the complete keyword set;
the processing module 620 is specifically configured to generate a first index according to the second part of keywords, the matching type, the third complete keywords, the correlation parameter, and the first document number, where an attribute value part of the first index further includes the first document number;
the processing module 620 is further configured to obtain, according to the first index, a number of second documents corresponding to the first complete keyword;
the processing module 620 is specifically configured to determine the score of the first document according to the number of the second documents, the correlation parameter between the first partial keyword and the first complete keyword, and the correlation parameter between the first document and the query statement.
Specifically, the first index is as shown in table 4.
Optionally, as an embodiment, the processing module 620 is specifically configured to generate the first index according to the following information:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting a second partial keyword, w representing a third full keyword, f representing a second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with respect to the third full keyword w, n (w) indicates the number of documents in the full keyword set that the third full keyword w matches, where the left part indicates the key value part of the first index, and the right part indicates the attribute value part of the first index.
Optionally, as an embodiment, the second complete keyword is a complete keyword in the complete keyword set, where the length of the character is greater than the threshold L.
Optionally, as an embodiment, the processing module 620 is specifically configured to obtain a complete keyword set according to an overall coverage rate of the corpus data set and a complete keyword expected to be covered, where a number of the complete keyword included in the complete keyword set is smaller than a number of the complete keyword included in the corpus data set.
Specifically, the input module 610 in the embodiment of the present invention may be implemented by a receiver or a receiver-related circuit, such as the transceiver 740 shown in fig. 8; the processing module 620 may be implemented by a processor or processor-related circuitry, such as the processor 710 shown in FIG. 8.
It should be understood that the apparatus 600 for information retrieval provided by the embodiment of the present invention may refer to a computer program product, which may be a software installation package, and when the software installation package is executed by a computer, the method for information retrieval provided by the embodiment of the present invention can be executed. In other words, the information retrieval apparatus 600 provided by the embodiment of the present invention may refer to a software module, and correspondingly, the input module 610 and the processing module included in the information retrieval apparatus 600 may refer to software sub-modules.
It should also be understood that the apparatus 600 for information retrieval provided in the embodiment of the present invention may be used to execute the method for information retrieval provided in the embodiment of the present invention, and the above and other operations and/or functions of each module in the apparatus 600 are respectively for implementing corresponding flows of each method in fig. 3 to fig. 5, and are not described herein again for brevity.
Fig. 8 is a schematic block diagram of a computing device 700 according to an embodiment of the invention, the computing device 700 comprising a processor 710, a memory 720, a bus system 730 and a transceiver 740, wherein the processor 710, the memory 720 and the transceiver 740 are connected via the bus system 730. Memory 720 is used to store programs (or instructions) and processor 710 is used to execute instructions (or programs) stored in memory 720 to control transceiver 740 to receive signals and/or transmit signals. Specifically, processor 710 reads instructions in memory 720 for: a control transceiver 740 receiving a query sentence, the query sentence including a fuzzy keyword, the fuzzy keyword being a word including a matching character; the processor 710 is further configured to determine a first part of keywords according to the fuzzy keywords, where the first part of keywords are parts of the fuzzy keywords except for the matching character; acquiring a first complete keyword according to a first index and a first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, information stored in the key value part has a corresponding relation with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword; and acquiring a first document according to the second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document.
In the embodiment of the invention, the first index comprises the corresponding relation between the partial keywords and the complete keywords, and the complete keywords corresponding to the partial keywords can be found more quickly through the first index, so that the embodiment of the invention can improve the overall efficiency of information retrieval in a fuzzy search scene.
Optionally, as an embodiment, the processor 710 is further configured to obtain a complete keyword set according to the corpus data set; performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword; determining a third complete keyword, wherein the third complete keyword is a word containing the second part of keywords in the complete keyword set; and generating a first index according to the second part of keywords and the third complete keywords, wherein the key value part of the first index comprises the second part of keywords, and the attribute value part of the first index comprises the third complete keywords.
Specifically, the first index is as shown in table 1.
Optionally, as an embodiment, the processor 710 is specifically configured to determine a third complete keyword based on the matching type, where the third complete keyword is matched with the second partial keyword based on the matching type, and the matching type is any one of a preceding word matching, a following word matching, or a preceding and following word matching;
the processor 710 is specifically configured to generate a first index according to the second partial keyword, the matching type, and the third complete keyword, where an attribute value part of the first index further includes the matching type;
the processor 710 is specifically configured to obtain a first complete keyword according to the first index, the first partial keyword, and the matching type of the first partial keyword, where the matching type of the first partial keyword is determined according to a position relationship between the first partial keyword and the matching symbol.
Specifically, the first index is as shown in table 2.
Optionally, as an embodiment, the processor 710 is further configured to determine a correlation parameter between the second partial keyword and the third complete keyword according to an occurrence frequency of the second partial keyword in the complete keyword set and an occurrence frequency of the third complete keyword in the complete keyword set;
the processor 710 is specifically configured to generate a first index according to the second part of keywords, the matching type, and the third complete keywords and the correlation parameters, where an attribute value part of the first index further includes the correlation parameters of the second part of keywords and the third complete keywords;
the processor 710 is further configured to obtain a correlation parameter between the first partial keyword and the first complete keyword according to the first index;
the processor 710 is further configured to determine a score for the first document based on the relevance parameter of the first partial keyword and the first full keyword, and the relevance parameter of the first document and the query statement.
Specifically, the first index is as shown in table 3.
Optionally, as an embodiment, the processor 710 is specifically configured to calculate the second partial keyword w according to the following formulapCorrelation parameter r (w) with third complete keyword wp,w):
r(wp,w)=α·f(wp,w)·s(wp,w)+β·q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w, α and β are constants.
Optionally, as an embodiment, the processor 710 is further configured to obtain a first number of documents, where the first number of documents is a number of documents matched by the third complete keyword in the complete keyword set;
the processor 710 is specifically configured to generate a first index according to the second partial keyword, the matching type, the third complete keyword, the correlation parameter, and the first document number, where an attribute value part of the first index further includes the first document number;
the processor 710 is further configured to obtain, according to the first index, a second number of documents corresponding to the first complete keyword;
the processor 710 is specifically configured to determine a score of the first document according to the number of the second documents, the relevance parameter of the first partial keyword and the first complete keyword, and the relevance parameter of the first document and the query sentence.
Specifically, the first index is as shown in table 4.
Optionally, as an embodiment, the processor 710 is specifically configured to generate the first index according to the following information:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting a second partial keyword, w representing a third full keyword, f representing a second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with respect to the third full keyword w, n (w) indicates the number of documents in the full keyword set that the third full keyword w matches, where the left part indicates the key value part of the first index, and the right part indicates the attribute value part of the first index.
Optionally, as an embodiment, the second complete keyword is a complete keyword in the complete keyword set, where the length of the character is greater than or equal to the threshold L.
Optionally, as an embodiment, the processor 710 is specifically configured to obtain a complete keyword set according to an overall coverage rate of the corpus data set and a complete keyword expected to be covered, where a number of the complete keyword included in the complete keyword set is smaller than a number of the complete keyword included in the corpus data set.
It should be understood that, in the embodiment of the present invention, the Processor 710 may be a Central Processing Unit (CPU), and the Processor 710 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or any conventional processor or the like.
Memory 720 may include both read-only memory and random-access memory, and provides instructions (programs) and data to processor 710. A portion of memory 720 may also include non-volatile random access memory. For example, memory 720 may also store device type information.
The bus system 730 may include a power bus, a control bus, a status signal bus, and the like, in addition to the data bus. For clarity of illustration, however, the various buses are designated in the figure as the bus system 730.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 710. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 720, and the processor 710 reads the information in the memory 720 and performs the steps of the above method in combination with the hardware thereof. To avoid repetition, it is not described in detail here.
It should be understood that the computing device 700 provided in the embodiment of the present invention may be used to execute the method for information retrieval provided in the embodiment of the present invention, and may correspond to the apparatus 600 for information retrieval provided in the embodiment of the present invention, and the above and other operations and/or functions of each module in the computing device 700 are respectively for implementing corresponding flows of each method in fig. 3 to fig. 5, and are not described herein again for brevity.
It should also be understood that the various numerical designations referred to herein are merely for convenience in description and are not intended to limit the scope of embodiments of the invention.
It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.
Claims (18)
1. A method of information retrieval, comprising:
receiving a query statement, wherein the query statement comprises fuzzy keywords which are words containing matching characters;
determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are the parts of the fuzzy keywords except the matching characters;
acquiring a first complete keyword according to a first index and the first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, information stored in the key value part has a corresponding relation with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword;
acquiring a first document according to a second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document;
acquiring a complete keyword set according to the corpus data set;
performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword;
determining a third complete keyword based on a matching type, wherein the third complete keyword is a word in the complete keyword set, the word comprises the second part of keywords, the third complete keyword is matched with the second part of keywords based on the matching type, and the matching type is any one of front word matching, rear word matching or front and rear word matching;
determining the correlation parameters of the second part of keywords and the third complete keywords according to the occurrence frequency of the second part of keywords in the complete keyword set and the occurrence frequency of the third complete keywords in the complete keyword set;
generating the first index according to the second part of keywords, the matching type, the third complete keywords and the correlation parameters, wherein key value parts of the first index comprise the second part of keywords, attribute value parts of the first index comprise the third complete keywords, and the matching type, the second part of keywords and the correlation parameters of the third complete keywords are matched;
the obtaining a first complete keyword according to the first index and the first partial keyword includes:
acquiring the first complete keyword according to the first index, the first part of keywords and the matching type of the first part of keywords, wherein the matching type of the first part of keywords is determined according to the position relation between the first part of keywords and the matching character;
acquiring correlation parameters of the first part of keywords and the first complete keywords according to the first index, the first part of keywords and the first complete keywords;
and determining the score of the first document according to the correlation parameter of the first part of keywords and the first complete keywords and the correlation parameter of the first document and the query statement.
2. The method of claim 1, wherein determining the relevance parameter of the second partial keyword and the third full keyword comprises:
calculating the second part of keywords w according to the following formulapA correlation parameter r (w) with the third complete keyword wp,w):
r(wp,w)=α·f(wp,w)·s(wp,w)+β·q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w, α and β are constants.
3. The method of claim 1, further comprising:
acquiring the number of first documents, wherein the number of the first documents is the number of documents matched with the third complete keyword in the complete keyword set;
wherein the generating the first index comprises:
generating the first index according to the second part of keywords, the matching type, the third complete keywords, the correlation parameters and the first document number, wherein the attribute value part of the first index also comprises the first document number;
the method further comprises the following steps:
acquiring the number of second documents corresponding to the first complete keyword according to the first index and the first complete keyword;
determining a score of the first document according to the relevance parameter of the first partial keyword and the first complete keyword and the relevance parameter of the first document and the query statement, wherein the determining the score of the first document comprises:
and determining the score of the first document according to the number of the second documents, the correlation parameter of the first partial key words and the first complete key words and the correlation parameter of the first document and the query statement.
4. The method of claim 3, wherein the generating the first index comprises:
generating the first index according to the following information:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting the second partial keyword, w representing the third full keyword, f representing the second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with the third full keyword w, n (w) represents the number of documents matched by the third full keyword w in the full keyword set, the left part represents the key value part of the first index, and the right part represents the attribute value part of the first indexAnd (4) dividing.
5. The method according to any of claims 1-4, characterized in that the second full keyword is a full keyword of the full keyword set having a character length greater than or equal to a threshold L.
6. The method according to any one of claims 1-4, wherein said obtaining a complete set of keywords from a corpus data set comprises:
and acquiring the complete keyword set according to the corpus data set and the coverage rate, wherein the number of the complete keywords in the complete keyword set is smaller than that of the complete keywords in the corpus data set, and the coverage rate is the ratio of the number of the expected desired complete keywords to the number of the complete keywords in the corpus data set.
7. An apparatus for information retrieval, comprising:
the system comprises an input module, a query module and a query module, wherein the input module is used for receiving a query statement, the query statement comprises fuzzy keywords, and the fuzzy keywords are words containing matching characters;
the processing module is used for determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are the parts of the fuzzy keywords except the matching characters;
the processing module is further configured to obtain a first complete keyword according to a first index and the first part of keywords, where the first complete keyword includes the first part of keywords, the first index includes a key value part and an attribute value part, where information stored in the key value part has a corresponding relationship with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword;
the processing module is further used for acquiring a first document according to a second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document;
the processing module is further used for acquiring a complete keyword set according to the corpus data set;
performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword;
determining a third complete keyword based on a matching type, wherein the third complete keyword is a word in the complete keyword set, the word comprises the second part of keywords, the third complete keyword is matched with the second part of keywords based on the matching type, and the matching type is any one of front word matching, rear word matching or front and rear word matching;
determining the correlation parameters of the second part of keywords and the third complete keywords according to the occurrence frequency of the second part of keywords in the complete keyword set and the occurrence frequency of the third complete keywords in the complete keyword set;
generating the first index according to the second part of keywords, the matching type, the third complete keywords and the correlation parameters, wherein key value parts of the first index comprise the second part of keywords, attribute value parts of the first index comprise the third complete keywords, and the matching type, the second part of keywords and the correlation parameters of the third complete keywords are matched;
the processing module is specifically configured to obtain the first complete keyword according to the first index, the first part of keywords, and the matching type of the first part of keywords, where the matching type of the first part of keywords is determined according to a position relationship between the first part of keywords and the matching symbol;
the processing module is further configured to obtain a correlation parameter between the first partial keyword and the first complete keyword according to the first index;
the processing module is further configured to determine a score of the first document according to the relevance parameter of the first partial keyword and the first complete keyword, and the relevance parameter of the first document and the query statement.
8. The apparatus according to claim 7, wherein the processing module is specifically configured to calculate the second partial keyword w according to the following formulapA correlation parameter r (w) with the third complete keyword wp,w):
r(wp,w)=α*f(wp,w)*s(wp,w)+β*q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w, α and β are constants.
9. The apparatus of claim 7, wherein the processing module is further configured to obtain a first number of documents, where the first number of documents is a number of documents that the third complete keyword matches in the complete keyword set;
the processing module is specifically configured to generate the first index according to the second partial keyword, the matching type, the third complete keyword, the relevance parameter, and the first document number, where an attribute value part of the first index further includes the first document number;
the processing module is further used for acquiring the number of second documents corresponding to the first complete keyword according to the first index;
the processing module is specifically configured to determine a score of the first document according to the number of the second documents, the correlation parameter between the first partial keyword and the first complete keyword, and the correlation parameter between the first document and the query statement.
10. The apparatus of claim 9, wherein the processing module is specifically configured to generate the first index according to the following information:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting the second partial keyword, w representing the third full keyword, f representing the second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with the third full keyword w, n (w) represents the number of documents matched by the third full keyword w in the full keyword set, the left part represents the key value part of the first index, and the right part represents the attribute value part of the first index.
11. The apparatus according to any of claims 7-10, wherein the second full keyword is a full keyword in the full keyword set having a character length greater than a threshold L.
12. The apparatus according to any one of claims 7-10, wherein the processing module is specifically configured to obtain the complete keyword set according to the corpus data set and a coverage rate, the number of complete keywords included in the complete keyword set is smaller than the number of complete keywords included in the corpus data set, and the coverage rate is a ratio of the number of expected desired complete keywords to the number of complete keywords in the corpus data set.
13. A computing device comprising a processor and a memory, the memory for storing instructions, the processor reading instructions stored in the memory for,
receiving a query statement, wherein the query statement comprises fuzzy keywords which are words containing matching characters;
determining a first part of keywords according to the fuzzy keywords, wherein the first part of keywords are the parts of the fuzzy keywords except the matching characters;
acquiring a first complete keyword according to a first index and the first part of keywords, wherein the first complete keyword comprises the first part of keywords, the first index comprises a key value part and an attribute value part, information stored in the key value part has a corresponding relation with information stored in the attribute value part, the key value part stores the first part of keywords, and the attribute value part stores the first complete keyword;
acquiring a first document according to a second index and the first complete keyword, wherein the first document is the document where the first complete keyword is located, and the second index comprises the corresponding relation between the first complete keyword and the first document;
the processor is further used for acquiring a complete keyword set according to the corpus data set;
performing n-tuple segmentation on a second complete keyword in the complete keyword set, determining words after the n-tuple segmentation as second partial keywords, wherein n is a positive integer less than or equal to the character length of the second complete keyword;
determining a third complete keyword based on a matching type, wherein the third complete keyword is a word in the complete keyword set, the word comprises the second part of keywords, the third complete keyword is matched with the second part of keywords based on the matching type, and the matching type is any one of front word matching, rear word matching or front and rear word matching;
determining the correlation parameters of the second part of keywords and the third complete keywords according to the occurrence frequency of the second part of keywords in the complete keyword set and the occurrence frequency of the third complete keywords in the complete keyword set;
generating the first index according to the second part of keywords, the matching type, the third complete keywords and the correlation parameters, wherein key value parts of the first index comprise the second part of keywords, attribute value parts of the first index comprise the third complete keywords, and the matching type, the second part of keywords and the correlation parameters of the third complete keywords are matched;
the processor is specifically configured to obtain the first complete keyword according to the first index, the first part of keywords, and the matching type of the first part of keywords, where the matching type of the first part of keywords is determined according to a position relationship between the first part of keywords and the matching symbol;
the processor is further configured to obtain a correlation parameter between the first partial keyword and the first complete keyword according to the first index;
the processor is further configured to determine a score for the first document according to the relevance parameter of the first partial keyword and the first complete keyword, and the relevance parameter of the first document and the query statement.
14. The computing device of claim 13, wherein the processor is specifically configured to calculate the second portion of keywords w according to the following formulapA correlation parameter r (w) with the third complete keyword wp,w):
r(wp,w)=α*f(wp,w)*s(wp,w)+β*q(wp,w)
Wherein, f (w)pW) represents wpConditional probability with w, s (w)pW) represents wpTightness parameter with w, q (w)pW) represents wpAnd w, α and β are constants.
15. The computing device of claim 13, wherein the processor is further configured to obtain a first number of documents, wherein the first number of documents is a number of documents that the third complete keyword matches in the complete keyword set;
the processor is specifically configured to generate the first index according to the second partial keyword, the matching type, the third complete keyword, the relevance parameter, and the first number of documents, where an attribute value part of the first index further includes the first number of documents;
the processor is further configured to obtain a second document number corresponding to the first complete keyword according to the first index;
the processor is specifically configured to determine a score of the first document according to the number of the second documents, the relevance parameter of the first partial keyword and the first complete keyword, and the relevance parameter of the first document and the query statement.
16. The computing device of claim 15, wherein the processor is further configured to generate the first index based on:
wp=f:w,r(wp,w),n(w)
wherein, wpRepresenting the second partial keyword, w representing the third full keyword, f representing the second partial keyword wpOf the matching type, r (w)pW) represents the second partial keyword wpA relevance parameter with the third full keyword w, n (w) represents the number of documents matched by the third full keyword w in the full keyword set, the left part represents the key value part of the first index, and the right part represents the attribute value part of the first index.
17. The computing device of any of claims 13-16, wherein the second full keyword is a full keyword in the set of full keywords having a character length greater than a threshold L.
18. The computing device according to any of claims 13 to 16, wherein the processor is specifically configured to obtain the complete keyword set according to the corpus data set and a coverage rate, wherein a number of complete keywords included in the complete keyword set is smaller than a number of complete keywords included in the corpus data set, and the coverage rate is a ratio of a number of expected desired complete keywords to the number of complete keywords in the corpus data set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610829891.0A CN106446122B (en) | 2016-09-19 | 2016-09-19 | Information retrieval method and device and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610829891.0A CN106446122B (en) | 2016-09-19 | 2016-09-19 | Information retrieval method and device and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106446122A CN106446122A (en) | 2017-02-22 |
CN106446122B true CN106446122B (en) | 2020-03-10 |
Family
ID=58169105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610829891.0A Active CN106446122B (en) | 2016-09-19 | 2016-09-19 | Information retrieval method and device and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106446122B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110019985B (en) * | 2017-12-29 | 2021-09-24 | 阿里巴巴(中国)有限公司 | Index file establishing and inquiring methods and devices |
CN110275990B (en) * | 2018-03-14 | 2021-04-23 | 北京忆芯科技有限公司 | Method and device for generating KV stored key and value |
CN109522389B (en) * | 2018-11-07 | 2020-09-01 | 中国联合网络通信集团有限公司 | Document pushing method and device and storage medium |
WO2020151015A1 (en) * | 2019-01-25 | 2020-07-30 | Microsoft Technology Licensing, Llc | Scoring documents in document retrieval |
CN113495984A (en) * | 2020-03-20 | 2021-10-12 | 华为技术有限公司 | Statement retrieval method and related device |
CN113536156B (en) * | 2020-04-13 | 2024-05-28 | 百度在线网络技术(北京)有限公司 | Search result ordering method, model building method, device, equipment and medium |
CN112732796B (en) * | 2021-01-23 | 2023-01-24 | 河北省科学院应用数学研究所 | Fuzzy query matching method |
CN113779305B (en) * | 2021-07-30 | 2024-01-02 | 北京达佳互联信息技术有限公司 | Information retrieval method and device and electronic equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1936896A (en) * | 2006-09-20 | 2007-03-28 | 网之易信息技术(北京)有限公司 | Information searching method and system based on searching engine |
CN101819578A (en) * | 2010-01-25 | 2010-09-01 | 青岛普加智能信息有限公司 | Retrieval method, method and device for establishing index and retrieval system |
US8059126B2 (en) * | 2007-03-30 | 2011-11-15 | Computer Associates Think, Inc. | System and method for indicating special characters to be interpreted literally |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090287680A1 (en) * | 2008-05-14 | 2009-11-19 | Microsoft Corporation | Multi-modal query refinement |
-
2016
- 2016-09-19 CN CN201610829891.0A patent/CN106446122B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1936896A (en) * | 2006-09-20 | 2007-03-28 | 网之易信息技术(北京)有限公司 | Information searching method and system based on searching engine |
US8059126B2 (en) * | 2007-03-30 | 2011-11-15 | Computer Associates Think, Inc. | System and method for indicating special characters to be interpreted literally |
CN101819578A (en) * | 2010-01-25 | 2010-09-01 | 青岛普加智能信息有限公司 | Retrieval method, method and device for establishing index and retrieval system |
Also Published As
Publication number | Publication date |
---|---|
CN106446122A (en) | 2017-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106446122B (en) | Information retrieval method and device and computing equipment | |
JP5597255B2 (en) | Ranking search results based on word weights | |
US10713261B2 (en) | Generating insightful connections between graph entities | |
US8965872B2 (en) | Identifying query formulation suggestions for low-match queries | |
US10783200B2 (en) | Systems and methods of de-duplicating similar news feed items | |
US9535911B2 (en) | Processing a content item with regard to an event | |
CN110019658B (en) | Method and related device for generating search term | |
CN107122469B (en) | Query recommendation ranking method and device based on semantic similarity and timeliness frequency | |
CN108804642A (en) | Search method, device, computer equipment and storage medium | |
CN112380244B (en) | Word segmentation searching method and device, electronic equipment and readable storage medium | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
JP2013545189A (en) | Determining category information using multistage | |
CN105468790B (en) | A kind of comment information search method and device | |
US10565188B2 (en) | System and method for performing a pattern matching search | |
CN111611471A (en) | Searching method and device and electronic equipment | |
JP5367632B2 (en) | Knowledge amount estimation apparatus and program | |
CN108287850B (en) | Text classification model optimization method and device | |
CN115630144A (en) | Document searching method and device and related equipment | |
CN111930949B (en) | Search string processing method and device, computer readable medium and electronic equipment | |
CN116842160A (en) | Patent search type generation method, system, equipment and medium | |
CN112818221B (en) | Entity heat determining method and device, electronic equipment and storage medium | |
CN115563242A (en) | Automobile information screening method and device, electronic equipment and storage medium | |
CN111782958A (en) | Recommendation word determining method and device, electronic device and storage medium | |
CN116610782B (en) | Text retrieval method, device, electronic equipment and medium | |
CN113220838B (en) | Method, device, electronic equipment and storage medium for determining key information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220210 Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province Patentee after: Huawei Cloud Computing Technologies Co.,Ltd. Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd. |
|
TR01 | Transfer of patent right |