[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN102214189A - Data mining-based word usage knowledge acquisition system and method - Google Patents

Data mining-based word usage knowledge acquisition system and method Download PDF

Info

Publication number
CN102214189A
CN102214189A CN2010101479937A CN201010147993A CN102214189A CN 102214189 A CN102214189 A CN 102214189A CN 2010101479937 A CN2010101479937 A CN 2010101479937A CN 201010147993 A CN201010147993 A CN 201010147993A CN 102214189 A CN102214189 A CN 102214189A
Authority
CN
China
Prior art keywords
words
word
candidate
input
webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010101479937A
Other languages
Chinese (zh)
Other versions
CN102214189B (en
Inventor
方高林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN 201010147993 priority Critical patent/CN102214189B/en
Publication of CN102214189A publication Critical patent/CN102214189A/en
Application granted granted Critical
Publication of CN102214189B publication Critical patent/CN102214189B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data mining-based word usage knowledge acquisition system and method. The system comprises an input device, a search analysis device, a multi-input mode processing device, a webpage analysis device, a usage knowledge extraction device and an output device, wherein the input device is used for inputting a word or a phrase to be searched; the search analysis device analyzes a keyword in the word or phrase to be searched, and processes the word and the phrase to be searched in the corresponding input mode processing device according to the analysis result; the multi-input mode processing device analyzes and expands the word or the phrase to be searched by utilizing semantic knowledge and dictionaries to form a search item, and searches the webpage information according to the search item so as to acquire a webpage related to the word or the phrase to be searched; the webpage analysis device analyzes the searched webpage, and converts the webpage into a candidate text; the usage knowledge extraction device processes the candidate text, and extracts context information and typical sentences of the word or the phrase to be searched; and the output device outputs the context information and the typical sentences. By adopting the device and the method, the word usage knowledge can be acquired accurately.

Description

System and method for acquiring word usage knowledge based on data mining
[ technical field ] A method for producing a semiconductor device
The invention relates to the technical field of computer information processing, in particular to a system and a method for acquiring word usage knowledge based on data mining.
[ background of the invention ]
When people read, write and translate foreign languages, people often encounter words and phrases which are not included in a dictionary, and the translated texts of the same word or phrase in different contexts are different, so that how to write out the genuine words and phrases is a problem for every person who uses the foreign languages. For Chinese students, the problem of how to write out the genuine sentences is more prominent due to the difference between Chinese and English culture and language style and the lack of knowledge about English collocation (such as form and name collocation, mobile name collocation and mobile media collocation).
The development of the internet provides unprecedented rich resources including electronic documents, online periodicals, magazines, newspapers, scientific and technical literature and the like, and with the rapid development of networks and information technologies, network resources become richer and richer. Usually, the knowledge of the usage of a word or phrase can be found by web search, however, the result obtained by only relying on a general search engine is difficult to be effective as the knowledge we need, because the search result only lists the web pages related to the word, and does not consider whether the word or phrase is related in linguistic role. In addition, the large amount of redundant information in the search results makes it difficult for the user to find instances of correct word usage. Therefore, mining useful knowledge in a large number of resources has become an important issue for network applications. The word usage system based on Web obtains collocation information and example sentences of words on the Internet so as to assist users in writing out genuine foreign language articles correctly.
[ summary of the invention ]
Based on this, there is a need for a system for obtaining word usage knowledge based on data mining that can more accurately obtain word usage knowledge.
A system for obtaining word usage knowledge based on data mining, the system comprising: the input device is used for inputting words or phrases to be searched; the query analysis device is used for analyzing the keywords in the words or phrases to be searched and sending the words or phrases to be searched into the corresponding input mode processing device for processing according to the analysis result; the multi-input mode processing device analyzes and expands the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searches webpage information according to the query terms to obtain webpages related to the words or phrases to be searched; the webpage analysis device is used for analyzing the searched webpage and converting the webpage into a candidate text; the usage knowledge extraction device is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched; and the output device outputs the context information and the typical example sentences.
Wherein the multiple input mode processing apparatus includes the following multiple input mode units: the system comprises a comparison mode unit, a category mode unit, a target language collocation mode unit and a single word mode unit, and also comprises a search engine retrieval module for retrieving a webpage;
the comparison mode unit adopts logic words to combine words or phrases into query terms, the category mode unit analyzes and expands input central words and category information to form query terms, the target word collocation mode unit translates and expands input collocation words to form query terms, the single word mode unit forms query terms according to the input single words, and the search engine retrieval module retrieves webpage information according to the query terms to acquire webpages related to the input words or phrases.
The webpage analysis device can further analyze the searched webpage information, remove repeated webpages, analyze each webpage into a document model tree form, remove non-text labels in the webpages in the document model tree, and reserve useful labels, so that the webpages are converted into candidate texts in a text form.
And the usage knowledge extraction device comprises: and the context information extraction unit is used for processing the candidate text into a single sentence through boundary identification, acquiring candidate words in the single sentence through keyword search, counting each candidate text by using a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.
Further, the context extraction unit further ranks the candidate words according to the occurrence frequency of the candidate words, selects a preset number of candidate words according to the ranking, and removes functional words and non-semantic words according to a stop word list to obtain a candidate list containing context information of the selected candidate words.
Wherein, the usage knowledge extraction device further comprises a typical example sentence extraction unit, and the typical example sentence extraction unit comprises: the candidate example sentence extraction module is used for extracting sentences containing the context information in the webpage candidate texts as candidate example sentences; the clustering module is used for clustering the candidate example sentences by utilizing a sentence clustering method based on characteristics; and the typical example sentence extraction module selects a sentence which is taken as a clustering center from the clustered sentences as a typical example sentence.
In addition, it is necessary to provide a method for acquiring word usage knowledge based on data mining, which can acquire word usage knowledge more accurately.
A method for acquiring word usage knowledge based on data mining comprises the following steps: A. receiving a word or phrase to be searched input by a user; B. analyzing the keywords in the words or phrases to be searched, and sending the words or phrases to be searched into a corresponding input mode for processing according to the analysis result; C. analyzing and expanding the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searching webpage information according to the query terms to obtain webpages related to the input words or phrases; D. analyzing the searched web page, and converting the web page into a candidate text; E. processing the candidate text, and extracting context information and typical example sentences of the words or phrases; F. and outputting the context information and the typical example sentence.
Wherein the input modes include one or more of the following modes: a comparison mode, a category mode, a target language collocation mode and a single word mode.
When the input mode is the comparison mode, the step C may specifically be: and combining the words or phrases into a query term by adopting the logic words, retrieving webpage information according to the query term, and acquiring the webpage related to the input words or phrases.
When the input mode is a category mode, the step C may specifically be: analyzing and expanding the input central word and category information according to semantic knowledge to form a query term, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.
When the input mode is the target language collocation mode, the step C may specifically be: analyzing and expanding the input collocation words according to the dictionary to form query terms, retrieving webpage information according to the query terms, and acquiring webpages related to the input words or phrases.
When the input mode is a single word mode, the step C may specifically be: and forming a query term according to the input single word, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.
And the step D may specifically be: analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree; and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.
Wherein, step E includes: processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.
Step E may further comprise: and sorting the candidate words according to the occurrence frequency of the candidate words, selecting preset data candidate words according to the sorting, and removing functional words and non-real words according to a stop word list to obtain a candidate list containing the context information of the selected candidate words.
Wherein, step E may further comprise: extracting sentences containing the context information from the single sentence as candidate example sentences; clustering the candidate example sentences by using a sentence clustering method based on characteristics; and selecting a sentence as a clustering center from the clustered sentences as a typical example sentence.
According to the system and the method for acquiring word usage knowledge based on data mining, the keywords of the words or phrases to be searched are analyzed and sent to the corresponding input mode processing device for processing, and compared with the method for searching by only using a single word, the information matched with the words or phrases to be searched can be acquired more accurately; the retrieved web pages are converted into candidate texts, and the context information and the typical example sentences of the words or phrases to be searched are extracted after the candidate texts are processed. The extracted context information and the typical example sentence can effectively reflect the usage of the words, can be conveniently used for obtaining the usage knowledge of the words, and improve the user experience requirements.
In addition, multiple input modes such as a comparison mode, a category mode, a target language collocation mode and the like can effectively limit retrieval conditions, so that more accurate word collocation knowledge can be mined under the condition of counting the same number of webpages; candidate example sentences are clustered through a sentence clustering method based on characteristics, and retrieved redundant example sentences are analyzed and clustered, so that the extracted typical example sentences are representative and can better meet the requirements of users.
[ description of the drawings ]
FIG. 1 is a block diagram that illustrates a system for obtaining word usage knowledge based on data mining, according to one embodiment;
FIG. 2 is a schematic diagram of a multi-input mode processing apparatus according to an embodiment;
FIG. 3 is a schematic diagram of the structure of a usage knowledge extraction apparatus in one embodiment;
FIG. 4 is a diagram illustrating an exemplary sentence extraction unit in accordance with one embodiment;
FIG. 5 is a flow diagram of a method for obtaining word usage knowledge based on data mining, under an embodiment;
FIG. 6 is a flow diagram of a method for processing multiple input modes in one embodiment;
FIG. 7 is a flow diagram of a method for extracting a representative example sentence in one embodiment;
FIG. 8 is a flow diagram of a clustering method based on key features in one embodiment.
[ detailed description ] embodiments
Fig. 1 shows a system for acquiring word usage knowledge based on data mining in one embodiment, which includes an input device 10, a query analysis device 20, a multiple input pattern processing device 30, a web page analysis device 40, a usage knowledge extraction device 50, and an output device 60. Wherein:
the input device 10 is used for inputting a word or phrase to be searched. In one embodiment, the word or phrase to be searched input by the input device 10 has multiple modes, for example, a knowledge of the usage of the word "solve" needs to be searched, and the search can be performed by using a single word input mode (e.g., "solve"), a target matching mode (e.g., "solve problem"), a category mode (e.g., "l < solve > difficulty, that," < l < solve > n. "etc.), a comparison mode (e.g.," solveprophlem/issue "), and the like.
The query analysis device 20 is configured to analyze the keywords in the word or phrase to be searched, and send the word or phrase to be searched into the corresponding input mode processing device for processing according to the analysis result. For the multiple input modes, the words or phrases input through different input modes are processed by corresponding different input mode processing devices, the query analysis device 20 analyzes the keywords in the input words or phrases, and when only a single word in the words or phrases is analyzed, the words or phrases are sent to a single word mode unit for processing; when the word or phrase contains the character "< >", the word or phrase is sent to a category mode unit for processing; when the words or phrases contain Chinese, the words or phrases are sent to a target language collocation mode unit for processing; when the word or phrase contains the character "/", it is sent to the comparison mode unit for processing.
The multi-input mode processing device 30 analyzes and expands the word or phrase to be searched by using semantic knowledge and a dictionary to form a query term, and searches the web page information according to the query term to obtain the web page related to the word or phrase to be searched. In one embodiment, as shown in fig. 2, the multiple input mode processing device 30 includes the following multiple input mode units: a comparison mode unit 301, a category mode unit 302, a target word collocation mode unit 303, and a single word mode unit 304, and further includes a search engine retrieval module 305 for retrieving web pages. The following describes the processing procedure in these input modes:
in the compare mode, for example, when the user inputs "lay/make foundation", the compare mode unit 301 needs to compare which phrase is the most common (i.e. the most tunnel usage) with "make foundation". The comparison mode unit 301 preferably combines words or phrases (i.e., candidate words in the input words or phrases) into query terms by using logical words, i.e., forms a new query, and then performs a search for related web pages through the search engine retrieval module 305. For example, for the "lay/make foundation", a new query term composed of logical words (OR, AND, etc.) is "lay OR make" AND foundation ", the query term is sent to the search engine retrieving module 305, AND the search engine retrieving module 305 can search for AND download the web pages that match the query term. In addition, the occurrence frequencies of the candidate words "lay", "make", and "foundation" can be counted, and the web pages can be sorted according to the occurrence frequencies. Since there are many web pages that may be retrieved, a limit to the number of web pages downloaded may be preset, e.g., the top 300 ordered web pages may be downloaded. The comparison mode can obtain the statistics of various collocation information only by one-time query, and is particularly suitable for the condition that various combinations appear after semantic expansion; it can find new collocation information, for example, when searching for "solution issue/query", it can also count it out because "recipe" is often sent together with "issue"; the searched web pages are ranked according to the candidate frequency of the candidate words, and the preset number of web pages can be selected to be more representative.
In the category mode, the category mode unit 302 analyzes and expands the input core word and category information to form a query term. Category patterns include two types, one is the entry of a core word and part-of-speech, e.g., "< solution > n."; one is to enter a core word and its synonyms, such as "< solution > difficulty, this". The part of speech and the synonym are used for indicating the category information of the candidate word collocated with the central word. In the category mode, because the candidate words matched with the central words are restrained through the category information, the candidate words matched with the central words can be obtained more accurately. The collocation here is generally divided into two categories: grammar collocation and dictionary collocation. Grammar collocation refers to the collocation connection between core words (names, adjectives and verbs), core words and prepositions or core words and other grammar structures, and comprises adjective-prepositions, noun indefinite forms, noun clauses, adjective-prepositions, verb indefinite forms and the like. Dictionary collocations typically include verbs-nouns, adjectives-nouns, verbs-adverbs, noun-prepositions, and verbs-prepositions. Words in the collocation process can be generally divided into 5 parts of speech: adjectives, verbs, nouns, adverbs, and prepositions, these 5 parts of speech may be used as category restrictions.
In order to further accurately describe the category information, synonyms can be used as upper and lower vectors for limitation, and search results can be reduced. Because the synonym needs to be provided by the user, and the amount of information which can be provided by the user is small, the synonym can be automatically expanded by utilizing the hypernym information in the WordNet semantic dictionary. WordNet is a dictionary database that organizes words into a network of synonym sets, each connection representing a relationship between them. For example: superior, inferior, synonymous, affiliation, etc. Based on the principle that words with similar meanings or belonging to the same class are always possible to occur together, the words in the upper-level relation in WordNet are used for expanding the query options so as to obtain possible meanings. For example, "< solution > thingqueous", with "thingqueous" as the context vector, and for expansion, the hypernym "difficuty" of "queous" is also added as the context vector, forming a new query term. Thus, the keyword "solution" and the contextual relevance vector "that is composed of a group of related words and reflects a detailed category information" that "will be sent to the search engine retrieval module 305 for retrieval of related web pages.
In the target matching mode, the target matching mode unit 303 translates and expands the input matching language to form a new query term, and the search engine retrieval module 305 retrieves web page information according to the query term to obtain a web page related to the input word or phrase. In one embodiment, inputting "solution question" to search the usage knowledge of "solution", the target word collocation pattern unit 303 performs restriction by collocation information of chinese to obtain the relevant web page. In this mode, the Chinese part is first translated according to the Chinese-English knowledge base. Because the translation options provided by the universal Chinese dictionary are single and cannot meet the requirement of Chinese semantic expansion, the problem can be solved by synonym expansion. Therefore, after the Chinese part is translated, the synonym is expanded to form characteristic word vectors as much as possible. For example, after inputting "solution question", translating AND synonym expanding, the formed new query term is "solution AND (issue OR matrix OR query)". The web pages retrieved by the search engine retrieval module 305 based on the query terms will be limited to the category of "question". In addition, the query term can be further expanded by combining with a WordNet semantic dictionary, and the query term is expanded by the words in the upper-level relation in the WordNet. After the above query terms are further expanded, a new query is formed as "solvent AND (issue OR machine OR protocol OR sensitivity)", where "sensitivity" is the hypernym of "issue".
In a single word mode, such as entering the single word "solve," the single word mode unit 304 forms a query term from the single word, and the search engine retrieval module 305 retrieves a web page containing the single word.
The web page analyzing device 40 is configured to analyze the searched web pages and convert the web pages into candidate texts. In one embodiment, the web page analyzing device 40 further analyzes the searched web pages to remove duplicate web pages, and analyzes each web page into a document model tree in which non-text labels in the web page are removed and useful labels (such as boundary symbols) are retained, thereby converting the web page into text candidates. The candidate text is used in a subsequent usage knowledge extraction process.
The usage knowledge extracting device 50 is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched. In one embodiment, as shown in fig. 3, the usage knowledge extraction apparatus 50 includes a context information extraction unit 501 and a typical example sentence extraction unit 502, where:
the context extraction unit 501 processes the candidate text into a single sentence through boundary identification, obtains candidate words in the single sentence through keyword search, obtains the occurrence frequency of the candidate words by performing statistics on each candidate text through a statistical algorithm, and outputs a candidate list of context information according to the occurrence frequency of the candidate words. In one embodiment, the candidate words in the single sentence searched by the context extraction unit 501 are the input word and the word matched with the input word or the word group, and after the occurrence frequency of the candidate words is counted by using a statistical algorithm, the candidate words can be sorted according to the occurrence frequency. In the statistics, only the co-occurrence information in one grammar sentence, that is, the sentence in which the candidate word appears in the same sentence is counted as the statistical content, and if the candidate word does not appear in the same sentence, the statistical content is not considered, so that the statistical result is more representative. After all the single sentences are counted, the single sentences are ranked according to the frequency of the counted word candidates, a preset number of candidate words are selected according to the ranking result, for example, the first 5 candidate words are selected, the candidate words with low frequency are removed, functional words (such as ' a ', ' an ', ' the ', ' and the like) and some non-ambiguous words are removed according to the stop word list, and a candidate list containing the context information of the selected candidate words is obtained. The candidate list can be divided according to the front and rear position information of the candidate words, and finally, the upper information (all possible words in front of the candidate words or phrases) and the lower information (all possible words behind the candidate words or phrases) of the candidate words or phrases are output.
The typical-example-sentence extracting unit 502 is used to extract a typical example sentence. As shown in fig. 4, in one embodiment, the exemplary sentence extraction unit 502 includes a candidate exemplary sentence extraction module 5021, a clustering module 5022, and an exemplary sentence extraction module 5023. Wherein: the candidate example sentence extraction module 5021 is used for extracting sentences containing the context information from the webpage candidate texts as candidate example sentences; the clustering module 5022 is used for clustering the candidate example sentences by using a sentence clustering method based on characteristics; the exemplary sentence extraction module 5023 is used for selecting a sentence which is a clustering center from the clustered sentences as the exemplary sentence.
In one embodiment, the candidate example sentence extraction module 5021 parses the web page candidate text into a single sentence. Specifically, a document may be divided into individual sentences according to punctuation marks of the sentences (e.g., ", etc.), and when distinguishing between" a period and a point following the abbreviation, "a list of abbreviations may be constructed and rules may be specified to determine whether the period is a period. In addition, the length of a separated single sentence can be limited, for example, a sentence containing more than 5 words and less than 30 words is used as a candidate example sentence.
In one embodiment, the clustering module 5022 clusters the candidate example sentences by using a sentence clustering method based on features as follows:
(1) and (5) initializing. Taking all the candidate example sentences obtained above as data segment samples, and calculating the matching distance d (O) between every two data segment samples by a characteristic distance-based methodi,Oj) Thus forming a distance matrix, and when the distance matrix is used later, the distance matrix can directly obtain the distance by using a table look-up method.
The method comprises the steps of utilizing a stop word list to analyze a sentence S into a sentence S with only main components, wherein words in the stop word list are removed, different word forms are restored, and a synonym dictionary is utilized to remove classes with similar semantemes in the sentence, so that each sentence represents the characteristics which are not related semanteme, and the method is similar to the main component analysis in pattern recognition. Let the two sentences after analysis be respectively expressed as: o is1=w1w2…wm,O2=w1w2…wnThe distance between them is defined as:
Figure GSA00000080931000091
wherein,
Figure GSA00000080931000092
representing semantic similarity between two words, if the semantics are similar or the two words are the same, defining the semantic similarity as 1, otherwise defining the semantic similarity as 0; m represents the number of sentences composed of the main words, and n represents the number of the main words in the sentences.
The number of clusters C to be expected and the threshold value theta of the inter-class distance for class merging are set in advanceCMinimum number of samples in each class θNAnd the maximum number of iterations tmax(ii) a And c represents the number of classes, and t represents the number of iterations.
(2) Initializing cluster centers
And respectively selecting sentences containing more words from the c webpages from different sources as initial clustering centers. Here, a threshold value of the number of candidate words contained in the sentence in the initial clustering center may be set in advance, and when the number of candidate words contained reaches the threshold value, the corresponding sentence serves as the initial clustering center.
(3) Sample classification
And dividing the data segment samples into various categories according to the principle of minimum distance, and recording the number of the samples of each category. For any O e.g. n, ifThen O is e Γj. Wherein m (gamma)j) Representing a gamma-likejIs a space containing all sentences, j represents a class number, and Γ isjIs the jth class all sample space. Checking the number of samples in each class simultaneously, if the number of samples is less than thetaNThen the class is dropped, let c be c-1, and the samples in the class are re-sorted to new onesIn a category.
(4) Recalculating cluster centers
Recalculating the cluster center m (Γ) for each classj) J is 1, 2, …, c. The calculation method of the clustering center is as follows:
finding the pseudo center O', which is ΓjAnd satisfies the number of elements whose distance to it is less than a certain threshold. Is provided with
Figure GSA00000080931000101
And σdAre each d (O)k,Ol) Mean and variance of (1), wherein Ok,Ol∈ΓjAnd then:
<math><mrow><mover><mi>d</mi><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>2</mn><mrow><msub><mi>N</mi><mi>j</mi></msub><mrow><mo>(</mo><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mi>k</mi><mo>+</mo><mn>1</mn></mrow><msub><mi>N</mi><mi>j</mi></msub></munderover><mi>d</mi><mrow><mo>(</mo><msub><mi>O</mi><mi>k</mi></msub><mo>,</mo><msub><mi>O</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow></math>
<math><mrow><msubsup><mi>&sigma;</mi><mi>d</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mn>2</mn><mrow><msub><mi>N</mi><mi>j</mi></msub><mrow><mo>(</mo><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mi>k</mi><mo>+</mo><mn>1</mn></mrow><msub><mi>N</mi><mi>j</mi></msub></munderover><msup><mi>d</mi><mn>2</mn></msup><mrow><mo>(</mo><msub><mi>O</mi><mi>k</mi></msub><mo>,</mo><msub><mi>O</mi><mi>l</mi></msub><mo>)</mo></mrow><mo>-</mo><msup><mover><mi>d</mi><mo>&OverBar;</mo></mover><mn>2</mn></msup></mrow></math>
wherein, the threshold is defined as follows:if only one element meets the above condition, taking the sample as a pseudo center; if two or more elements simultaneously satisfy the condition, then the gamma is adjustedjAll samples with matching distances smaller than a threshold value are taken as subclasses of the class, average intra-class distances between each element in the subclasses and other elements are calculated, and the element with the minimum average intra-class distance is selected as a pseudo center. The pseudo center obtained by calculation is the sample closest to the actual clustering center, and can replace the actual clustering center.
(5) If this is an even number of iterations or C ≧ 2C, then step (8) is diverted, otherwise continue.
(6) Calculating intra-class distance
Calculating gammajOverall within-class distance λ ofj And average intra-class distance
Figure GSA00000080931000106
Figure GSA00000080931000107
j=1,2,…,c。
(7) Class splitting
The class with the largest intra-class distance is split into two classes. Maximum intra-class distance mayThere are two options: the overall intra-class distance and the average intra-class distance. Let the selected class be ΓjmaxIf | Fjmax‖≥2θNOr C is less than or equal to C/2, gammajmaxWill be split as follows to find two sample data O in the classp1And Op2So that for any sample pair O in the classp3And Op4Satisfies d (O)p1,Op2)≥d(Op3,Op4),Op1And Op2And (4) replacing the original clustering center with two new clustering centers, and turning to the step (9) when c is equal to c + 1.
(8) Calculating inter-class distance
Calculating the distances between every two clustering centers by using the characteristic distance calculation method based on the principal components: d (m (gamma)i),m(Γj)),1≤i,j≤c。
(9) Class merging
Find all d (m (gamma)i),m(Γj) Minimum value d (m (Γ)) ofp),m(Γq) If d (m (Γ)p),m(Γq))<θCThen would be like ΓpAnd gamma-likeqMerging, and calculating a new clustering center by using the step (4)
Figure GSA00000080931000109
And let c be c-1.
(10) t is t +1, if t < tmaxAnd (3) turning to the step (3), otherwise, storing the data related to the clustering: cluster number c, cluster center, and the sample closest to the cluster center (i.e., pseudo center), end.
After the cluster center and the sample closest to the cluster center (i.e., the pseudo center, which may also be used as the cluster center) are obtained through calculation, the typical example sentence extraction module 5023 extracts the sentences serving as the cluster centers (including the actual cluster center and the sample close to the actual cluster center) and outputs the sentences serving as the typical example sentences.
The output device 60 is used for outputting the obtained context information and the typical example sentence.
FIG. 5 shows a flow of a method for obtaining word usage knowledge based on data mining in an embodiment, which includes the following specific processes:
and step S10, receiving the word or phrase to be searched input by the user. In one embodiment, the word or phrase to be searched can be input in a plurality of input modes, for example, the usage knowledge for the word "solution" that needs to be searched, and a plurality of modes such as "solution", "solution question", "< solution > difficity, that", "< solution > n.", "solution epiblemem/issue" can be input for searching.
And step S20, analyzing the keywords in the word or phrase to be searched, and sending the word or phrase to be searched into a corresponding input mode for processing according to the analysis result. . For the multiple input modes, the words or phrases input through different input modes are processed by corresponding different input mode processing devices, and keywords in the input words or phrases are analyzed. When only a single word is analyzed in the word or the word group, the word or the word group is sent to the single word mode unit 304 for processing; when the word or phrase contains the character "< >", the word or phrase is sent to the class mode unit 302 for processing; when the words or phrases contain Chinese, the words or phrases are sent to the target language collocation mode unit 303 for processing; when a word or phrase contains the character "/", it is sent to the comparison mode unit 301 for processing.
Step S30, analyzing and expanding the word or phrase to be searched by semantic knowledge and dictionary to form a search term, searching the web page information according to the search term to obtain the web page related to the input word or phrase. In one embodiment, as shown in fig. 6, the specific process of step S30 is as follows:
in step S301, when the input mode is the comparison mode, a word or a phrase is combined into a query term by using a logical word, so as to form a new query. For example, for "lay/make foundation", a new query term composed of logical words (OR, AND, etc.) is "(lay OR make) AND foundation", the query term is sent to the search engine retrieving module 305, AND the search engine retrieving module 305 can search for a web page matching the query term AND download the web page. In addition, the occurrence frequencies of the candidate words "lay", "make", and "fountain" can be counted, and the web pages can be sorted according to the occurrence frequencies. Since there are many web pages that may be retrieved, a limit to the number of web pages downloaded may be preset, such as the top 300 ordered web pages that may be downloaded. The comparison mode can obtain statistics of various collocation information only by one-time query, and is particularly suitable for the condition that many combinations appear after semantic expansion; it can find new collocation information, for example, when searching for "solvaissue/query", it can also count it out because "recipe" is often sent together with "issue"; the searched web pages are ranked according to the candidate frequency of the candidate words, and the preset number of web pages can be selected to be more representative.
In step S302, when the input mode is the category mode, the input core word and the category information are analyzed and expanded to form a query term. Category patterns include two types, one is the entry of a core word and part-of-speech, e.g., "< solution > n."; one is to enter a core word and its synonyms, such as "< solution > difficulty, this". The part of speech and the synonym are used for indicating the category information of the candidate word collocated with the central word.
In order to further accurately describe the category information, synonyms can be used as upper and lower vectors for limitation, and search results can be reduced. Because the synonym needs to be provided by the user, and the amount of information which can be provided by the user is small, the synonym can be automatically expanded by utilizing the hypernym information in the WordNet semantic dictionary. For example, "< solution > this query", with "this query" as the context vector, and for expansion, the hypernym "difficuty" of "query" is also added as the context vector, forming a new query term.
In step S303, when the input mode is the target language mode, the input collocations are translated and expanded to form new query terms. In one embodiment, inputting "solution question" wants to find knowledge of the usage of "solution", and the relevant web page is obtained by limiting the collocation information of Chinese. In this mode, the Chinese part is first translated according to the Chinese-English knowledge base. Because the translation options provided by the universal Chinese dictionary are single and cannot meet the requirement of Chinese semantic expansion, the problem can be solved by synonym expansion. Therefore, after the Chinese part is translated, the synonym is expanded to form characteristic word vectors as much as possible. In addition, the query term can be further expanded by combining with a WordNet semantic dictionary, and the query term is expanded by the words in the upper-level relation in the WordNet.
In step S304, when the input mode is a single word mode, a query term is formed from the single word.
In step S305, web page information is retrieved according to the generated query term, and a web page related to the input word or phrase is acquired.
Step S40, analyzing the web page obtained by the search, and converting the web page into a candidate text. In one embodiment, the specific process of step S40 is: analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree; and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.
And step S50, processing the candidate text, and extracting context information and typical example sentences of the words or phrases. In one embodiment, step S50 includes extracting context information of a word or a phrase and extracting two parts of a typical example sentence of the word or the phrase, where the process of extracting the context information of the word or the phrase is specifically as follows: processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words. In this embodiment, the candidate words may be further sorted according to the occurrence frequency of the candidate words, a preset number of candidate words are selected according to the sorting, and the functional words and the non-semantic words are removed according to the stop word list, so as to obtain a candidate list including context information of the selected candidate words.
In one embodiment, as shown in fig. 7, the process of extracting the typical example sentence is specifically as follows:
in step S501, sentences including the context information in the single sentence are extracted as candidate example sentences. In one embodiment, the specific process of step S501 is: the candidate texts of the web page are analyzed into a single sentence. Specifically, a document may be divided into individual sentences according to punctuation marks of the sentences (e.g., ", etc.), and when distinguishing between" a period and a point following the abbreviation, "a list of abbreviations may be constructed and rules may be specified to determine whether the period is a period. In addition, the length of a separated single sentence can be limited, for example, a sentence containing more than 5 words and less than 30 words is used as a candidate example sentence.
In step S502, the candidate example sentences are clustered using a feature-based sentence clustering method. In one embodiment, as shown in fig. 8, the specific process of step S502 is as follows:
(1) and (5) initializing. Taking all the candidate example sentences obtained above as data segment samples, and calculating the matching distance d (O) between every two data segment samples by a characteristic distance-based methodi,Oj) Thus forming a distance matrix, and when the distance matrix is used later, the distance matrix can directly obtain the distance by using a table look-up method.
The characteristic distance calculation based on main components is to analyze the sentence S into only main components by using the stop word list, wherein the method comprises the steps of removing words in the stop word list, restoring different word forms, and removing similar semantic classes in the sentence by using the synonym dictionary, so that each sentence represents the semantically irrelevant characteristics, and the method is similar to the mode recognitionThe principal component analysis of (1). Let the two sentences after analysis be respectively expressed as: o is1=w1w2…wm,O2=w1w2…wnThe distance between its doors is defined as:
Figure GSA00000080931000141
wherein,
Figure GSA00000080931000142
representing semantic similarity between two words, if the semantics are similar or the two words are the same, defining the semantic similarity as 1, otherwise defining the semantic similarity as 0; m represents the number of sentences composed of the main words, and n represents the number of the main words in the sentences.
The number of clusters C to be expected and the threshold value theta of the inter-class distance for class merging are set in advanceCMinimum number of samples in each class θNAnd the maximum number of iterations tmax(ii) a And c represents the number of classes, and t represents the number of iterations.
(2) Initializing cluster centers
And respectively selecting sentences containing more words from the c webpages from different sources as initial clustering centers. Here, a threshold value of the number of candidate words contained in the sentence in the initial clustering center may be set in advance, and when the number of candidate words contained reaches the threshold value, the corresponding sentence serves as the initial clustering center.
(3) Sample classification
And dividing the data segment samples into various categories according to the principle of minimum distance, and recording the number of the samples of each category. For any O e.g. n, if
Figure GSA00000080931000143
Then O is e Γj. Wherein m (gamma)j) Representing a gamma-likejIs a space containing all sentences, j represents a class number, and Γ isjIs the jthAll sample spaces are classified. Checking the number of samples in each class simultaneously, if the number of samples is less than thetaNThen the class is dropped, let c be c-1, and the samples in the class are re-sorted into a new class.
(4) Recalculating cluster centers
Recalculating the cluster center m (Γ) for each classj) J is 1, 2, …, c. The calculation method of the clustering center is as follows:
finding the pseudo center O', which is ΓjAnd satisfies the number of elements whose distance to it is less than a certain threshold. Is provided with
Figure GSA00000080931000144
And σdAre each d (O)k,Ol) Mean and variance of (1), wherein Ok,Ol∈ΓjAnd then:
<math><mrow><mover><mi>d</mi><mo>&OverBar;</mo></mover><mo>=</mo><mfrac><mn>2</mn><mrow><msub><mi>N</mi><mi>j</mi></msub><mrow><mo>(</mo><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mi>k</mi><mo>+</mo><mn>1</mn></mrow><msub><mi>N</mi><mi>j</mi></msub></munderover><mi>d</mi><mrow><mo>(</mo><msub><mi>O</mi><mi>k</mi></msub><mo>,</mo><msub><mi>O</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow></math>
<math><mrow><msubsup><mi>&sigma;</mi><mi>d</mi><mn>2</mn></msubsup><mo>=</mo><mfrac><mn>2</mn><mrow><msub><mi>N</mi><mi>j</mi></msub><mrow><mo>(</mo><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn><mo>)</mo></mrow></mrow></mfrac><munderover><mi>&Sigma;</mi><mrow><mi>k</mi><mo>=</mo><mn>1</mn></mrow><mrow><msub><mi>N</mi><mi>j</mi></msub><mo>-</mo><mn>1</mn></mrow></munderover><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mi>k</mi><mo>+</mo><mn>1</mn></mrow><msub><mi>N</mi><mi>j</mi></msub></munderover><msup><mi>d</mi><mn>2</mn></msup><mrow><mo>(</mo><msub><mi>O</mi><mi>k</mi></msub><mo>,</mo><msub><mi>O</mi><mi>l</mi></msub><mo>)</mo></mrow><mo>-</mo><msup><mover><mi>d</mi><mo>&OverBar;</mo></mover><mn>2</mn></msup></mrow></math>
wherein, the threshold is defined as follows:
Figure GSA00000080931000147
if only one element meets the above condition, taking the sample as a pseudo center; if two or more elements simultaneously satisfy the condition, then the gamma is adjustedjAll samples with matching distances smaller than a threshold value are taken as subclasses of the class, average intra-class distances between each element in the subclasses and other elements are calculated, and the element with the minimum average intra-class distance is selected as a pseudo center. The pseudo center obtained by calculation is the sample closest to the actual clustering center, and can replace the actual clustering center.
(5) If this is an even number of iterations or C ≧ 2C, then step (8) is diverted, otherwise continue.
(6) Calculating intra-class distance
Calculating gammajOverall within-class distance λ ofj And average intra-class distance
Figure GSA00000080931000151
Figure GSA00000080931000152
Figure GSA00000080931000153
j=1,2,…,c。
(7) Class splitting
The class with the largest intra-class distance is split into two classes. The maximum intra-class distance can be chosen in two ways: the overall intra-class distance and the average intra-class distance. Let the selected class be ΓjmaxIf | Fjmax‖≥2θNOr C is less than or equal to C/2, gammajmaxWill be split as follows to find two sample data O in the classp1And Op2So that for any sample pair O in the classp3And Op4Satisfies d (O)p1,Op2)≥d(Op3,Op4),Op1And Op2And (4) replacing the original clustering center with two new clustering centers, and turning to the step (9) when c is equal to c + 1.
(8) Calculating inter-class distance
Calculating the distances between every two clustering centers by using the characteristic distance calculation method based on the principal components: d (m (gamma)i),m(Γj)),1≤i,j≤c。
(9) Class merging
Find all d (m (gamma)i),m(Γj) Minimum value d (m (Γ)) ofp),m(Γq) If d (m (Γ)p),m(Γq))<θCThen would be like ΓpAnd gamma-likeqMerging, and calculating a new clustering center by using the step (4)And let c be c-1.
(10) t is t +1, if t < tmaxAnd (3) turning to the step (3), otherwise, storing the data related to the clustering: cluster number c, cluster center, and the sample closest to the cluster center (i.e., pseudo center), end.
In step S503, a sentence that is the center of the cluster is selected as a typical example sentence among the clustered sentences. Specifically, a sentence with an actual cluster center and a pseudo center closest to the actual cluster center may be selected as a typical example sentence.
And step S60, outputting the context information and the typical example sentence.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (16)

1. A system for obtaining word usage knowledge based on data mining, the system comprising:
the input device is used for inputting words or phrases to be searched;
the query analysis device is used for analyzing the keywords in the words or phrases to be searched and sending the words or phrases to be searched into the corresponding input mode processing device for processing according to the analysis result;
the multi-input mode processing device analyzes and expands the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searches webpage information according to the query terms to obtain webpages related to the words or phrases to be searched;
the webpage analysis device is used for analyzing the searched webpage and converting the webpage into a candidate text;
the usage knowledge extraction device is used for processing the candidate text and extracting the context information and the typical example sentence of the word or phrase to be searched;
and the output device outputs the context information and the typical example sentences.
2. The system for obtaining knowledge of word usage based on data mining of claim 1, wherein the multiple input mode processing means includes a plurality of input mode units: the system comprises a comparison mode unit, a category mode unit, a target language collocation mode unit and a single word mode unit, and also comprises a search engine retrieval module for retrieving a webpage;
the comparison mode unit adopts logic words to combine words or phrases into query terms, the category mode unit analyzes and expands input central words and category information to form query terms, the target word collocation mode unit translates and expands input collocation words to form query terms, the single word mode unit forms query terms according to the input single words, and the search engine retrieval module retrieves webpage information according to the query terms to acquire webpages related to the input words or phrases.
3. The system for obtaining word usage knowledge based on data mining as claimed in claim 1, wherein the web page analysis means further analyzes the searched web page information to remove duplicate web pages, analyzes each web page into a document model tree in which non-text tags in the web page are removed and useful tags are retained, thereby converting the web page into candidate text in text form.
4. The system for obtaining word usage knowledge based on data mining as claimed in claim 2 or 3, wherein the usage knowledge extraction means comprises:
and the context information extraction unit is used for processing the candidate text into a single sentence through boundary identification, acquiring candidate words in the single sentence through keyword search, counting each candidate text by using a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.
5. The system for obtaining word usage knowledge based on data mining of claim 4, wherein the context extraction unit is further configured to sort the candidate words according to the occurrence frequency of the candidate words, select a preset number of candidate words according to the sort, and remove functional words and non-semantic words according to a stop word list to obtain a candidate list containing context information of the selected candidate words.
6. The system for acquiring word usage knowledge based on data mining as claimed in claim 4, wherein the usage knowledge extraction device further comprises a typical example sentence extraction unit, the typical example sentence extraction unit comprising:
the candidate example sentence extraction module is used for extracting sentences containing the context information in the webpage candidate texts as candidate example sentences;
the clustering module is used for clustering the candidate example sentences by utilizing a sentence clustering method based on characteristics;
and the typical example sentence extraction module selects a sentence which is taken as a clustering center from the clustered sentences as a typical example sentence.
7. A method for acquiring word usage knowledge based on data mining comprises the following steps:
A. receiving a word or phrase to be searched input by a user;
B. analyzing the keywords in the words or phrases to be searched, and sending the words or phrases to be searched into a corresponding input mode for processing according to the analysis result;
C. analyzing and expanding the words or phrases to be searched by utilizing semantic knowledge and a dictionary to form query terms, and searching webpage information according to the query terms to obtain webpages related to the input words or phrases;
D. analyzing the searched web page, and converting the web page into a candidate text;
E. processing the candidate text, and extracting context information and typical example sentences of the words or phrases;
F. and outputting the context information and the typical example sentence.
8. The method for obtaining knowledge of word usage based on data mining of claim 7, wherein the input patterns include one or more of the following: a comparison mode, a category mode, a target language collocation mode and a single word mode.
9. The method for obtaining knowledge of word usage based on data mining of claim 8, wherein the input pattern is a comparison pattern, and the step C is specifically: and combining the words or phrases into a query term by adopting the logic words, retrieving webpage information according to the query term, and acquiring the webpage related to the input words or phrases.
10. The method for obtaining knowledge of word usage based on data mining of claim 8, wherein the input pattern is a category pattern, and the step C is specifically: analyzing and expanding the input central word and category information according to semantic knowledge to form a query term, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.
11. The method for obtaining knowledge of word usage based on data mining of claim 8, wherein the input pattern is a target language collocation pattern, and the step C specifically comprises: analyzing and expanding the input collocation words according to the dictionary to form query terms, retrieving webpage information according to the query terms, and acquiring webpages related to the input words or phrases.
12. The method for obtaining knowledge of word usage based on data mining of claim 8, wherein the input pattern is a single word pattern, and the step C is specifically: and forming a query term according to the input single word, retrieving webpage information according to the query term, and acquiring a webpage related to the input word or phrase.
13. The method for obtaining knowledge of word usage based on data mining of claim 7, wherein the step D is specifically:
analyzing the searched webpage information, removing repeated webpages, and analyzing each webpage into a form of a document model tree;
and in the document model tree, removing non-text labels in the webpage, reserving useful labels, and converting the webpage into candidate texts in a text form.
14. The method for obtaining knowledge of word usage based on data mining as claimed in claim 13, wherein said step E comprises:
processing the candidate text into a single sentence through boundary identification, obtaining candidate words in the single sentence through keyword search, counting each candidate text by utilizing a statistical algorithm to obtain the occurrence frequency of the candidate words, and outputting a candidate list of context information according to the occurrence frequency of the candidate words.
15. The method for obtaining knowledge of word usage based on data mining of claim 14, wherein said step E further comprises:
and sorting the candidate words according to the occurrence frequency of the candidate words, selecting preset data candidate words according to the sorting, and removing functional words and non-real words according to a stop word list to obtain a candidate list containing the context information of the selected candidate words.
16. The method for obtaining knowledge of word usage based on data mining of claim 14, wherein said step E further comprises:
extracting sentences containing the context information from the single sentence as candidate example sentences;
clustering the candidate example sentences by using a sentence clustering method based on characteristics;
and selecting a sentence as a clustering center from the clustered sentences as a typical example sentence.
CN 201010147993 2010-04-09 2010-04-09 Data mining-based word usage knowledge acquisition system and method Active CN102214189B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010147993 CN102214189B (en) 2010-04-09 2010-04-09 Data mining-based word usage knowledge acquisition system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010147993 CN102214189B (en) 2010-04-09 2010-04-09 Data mining-based word usage knowledge acquisition system and method

Publications (2)

Publication Number Publication Date
CN102214189A true CN102214189A (en) 2011-10-12
CN102214189B CN102214189B (en) 2013-04-24

Family

ID=44745504

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010147993 Active CN102214189B (en) 2010-04-09 2010-04-09 Data mining-based word usage knowledge acquisition system and method

Country Status (1)

Country Link
CN (1) CN102214189B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176967A (en) * 2011-12-23 2013-06-26 英顺源(上海)科技有限公司 Translation inquiring system and translation inquiring method based on a plurality of inquiring words
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103678407A (en) * 2012-09-24 2014-03-26 富士通株式会社 Data processing method and data processing device
CN105955993A (en) * 2016-04-19 2016-09-21 北京百度网讯科技有限公司 Method and device for sequencing search results
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
WO2018161516A1 (en) * 2017-03-07 2018-09-13 京东方科技集团股份有限公司 Method and device for automatic discovery of medical knowledge
CN108628821A (en) * 2017-03-21 2018-10-09 腾讯科技(深圳)有限公司 A kind of vocabulary mining method and device
CN109213777A (en) * 2017-06-29 2019-01-15 杭州九阳小家电有限公司 A kind of voice-based recipe processing method and system
CN110569335A (en) * 2018-03-23 2019-12-13 百度在线网络技术(北京)有限公司 triple verification method and device based on artificial intelligence and storage medium
CN114860872A (en) * 2022-04-13 2022-08-05 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000020520A (en) * 1998-07-07 2000-01-21 Keiichi Kato Method and system for language analysis and recognition processing, storage medium having stored language analysis and recognition processing program thereon and storage medium having recorded data group prepared by the method thereon
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101436198A (en) * 2008-12-12 2009-05-20 腾讯科技(深圳)有限公司 Method and device for improving search accuracy rate

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000020520A (en) * 1998-07-07 2000-01-21 Keiichi Kato Method and system for language analysis and recognition processing, storage medium having stored language analysis and recognition processing program thereon and storage medium having recorded data group prepared by the method thereon
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101436198A (en) * 2008-12-12 2009-05-20 腾讯科技(深圳)有限公司 Method and device for improving search accuracy rate

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103176967A (en) * 2011-12-23 2013-06-26 英顺源(上海)科技有限公司 Translation inquiring system and translation inquiring method based on a plurality of inquiring words
CN103678407A (en) * 2012-09-24 2014-03-26 富士通株式会社 Data processing method and data processing device
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN103678287B (en) * 2013-11-30 2016-12-07 语联网(武汉)信息技术有限公司 A kind of method that keyword is unified
CN105955993A (en) * 2016-04-19 2016-09-21 北京百度网讯科技有限公司 Method and device for sequencing search results
CN107515877A (en) * 2016-06-16 2017-12-26 百度在线网络技术(北京)有限公司 The generation method and device of sensitive theme word set
WO2018161516A1 (en) * 2017-03-07 2018-09-13 京东方科技集团股份有限公司 Method and device for automatic discovery of medical knowledge
US11455546B2 (en) 2017-03-07 2022-09-27 Beijing Boe Technology Development Co., Ltd. Method and apparatus for automatically discovering medical knowledge
CN108628821A (en) * 2017-03-21 2018-10-09 腾讯科技(深圳)有限公司 A kind of vocabulary mining method and device
CN108628821B (en) * 2017-03-21 2022-11-25 腾讯科技(深圳)有限公司 Vocabulary mining method and device
CN109213777A (en) * 2017-06-29 2019-01-15 杭州九阳小家电有限公司 A kind of voice-based recipe processing method and system
CN110569335A (en) * 2018-03-23 2019-12-13 百度在线网络技术(北京)有限公司 triple verification method and device based on artificial intelligence and storage medium
US11275810B2 (en) 2018-03-23 2022-03-15 Baidu Online Network Technology (Beijing) Co., Ltd. Artificial intelligence-based triple checking method and apparatus, device and storage medium
CN114860872A (en) * 2022-04-13 2022-08-05 北京百度网讯科技有限公司 Data processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN102214189B (en) 2013-04-24

Similar Documents

Publication Publication Date Title
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
Gupta et al. A survey of text question answering techniques
US20150227505A1 (en) Word meaning relationship extraction device
US20110295857A1 (en) System and method for aligning and indexing multilingual documents
Imam et al. An ontology-based summarization system for arabic documents (ossad)
JP2011118689A (en) Retrieval method and system
Ahmed et al. Revised n-gram based automatic spelling correction tool to improve retrieval effectiveness
Eger et al. Lemmatization and Morphological Tagging in German and Latin: A Comparison and a Survey of the State-of-the-art
CN115794995A (en) Target answer obtaining method and related device, electronic equipment and storage medium
Jain et al. Context sensitive text summarization using k means clustering algorithm
Abdurakhmonova et al. Uzbek electronic corpus as a tool for linguistic analysis
Awajan Semantic similarity based approach for reducing Arabic texts dimensionality
Nehar et al. Rational kernels for Arabic root extraction and text classification
Hirpassa Information extraction system for Amharic text
CN110688559A (en) Retrieval method and device
Ahmed et al. Gold dataset for the evaluation of bangla stemmer
Kedtiwerasak et al. Thai keyword extraction using textrank algorithm
CN112949287A (en) Hot word mining method, system, computer device and storage medium
Bhaskar et al. Theme based English and Bengali ad-hoc monolingual information retrieval in fire 2010
Thanadechteemapat et al. Thai word segmentation for visualization of thai web sites
Baishya et al. Present state and future scope of Assamese text processing
Faisol et al. Sentiment analysis of yelp review
Abdullah et al. Feature-based POS tagging and sentence relevance for news multi-document summarization in Bahasa Indonesia
Chakraborty et al. Syntactic Category based Assamese Question Pattern Extraction using N-grams
Li et al. Concept unification of terms in different languages via web mining for Information Retrieval

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131015

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20131015

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Patentee after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.