[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112860781A - Mining and displaying method combining vocabulary collocation extraction and semantic classification - Google Patents

Mining and displaying method combining vocabulary collocation extraction and semantic classification Download PDF

Info

Publication number
CN112860781A
CN112860781A CN202110162745.8A CN202110162745A CN112860781A CN 112860781 A CN112860781 A CN 112860781A CN 202110162745 A CN202110162745 A CN 202110162745A CN 112860781 A CN112860781 A CN 112860781A
Authority
CN
China
Prior art keywords
collocation
word
classification
semantic
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110162745.8A
Other languages
Chinese (zh)
Inventor
陈永朝
陈永胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110162745.8A priority Critical patent/CN112860781A/en
Publication of CN112860781A publication Critical patent/CN112860781A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/26Visual data mining; Browsing structured data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for mining and displaying combination of word collocation extraction and semantic classification, and belongs to the field of natural language processing and language learning application. The method comprises the following steps: extracting and filtering word collocation by using a dependency syntax semantic model and a statistical method based on deep learning; calculating to obtain a many-to-many collocation matrix by combining the sorted word classification and the extracted collocation; screening, sorting and recombining the collocation words and word classifications according to the collocation density; displaying the finally obtained word classification collocation matrix in a card form of m × n or m × n × o; the cards are displayed according to the hierarchical sequence of the vocabulary semantic classification. The method can obtain a collocation and vocabulary classification system with more general significance and a display form of a many-to-many word classification collocation matrix, the display form is visual and vivid, the information density is high, and the system learning of language vocabularies is facilitated through a grouping mode of combining semantic usages.

Description

Mining and displaying method combining vocabulary collocation extraction and semantic classification
Technical Field
The invention relates to a resource, book and software manufacturing method in the fields of natural language processing and language learning application. In particular to a collocation extraction and screening method of a dependency syntax semantic analysis and statistical method based on deep learning; a mutual authentication algorithm relating to word classification and collocation; relates to a display mode of many-to-many linked cards and card books (paper books, electronic books and language learning software) for word classification and collocation.
Background
Collocation extraction belongs to a more classical subject in natural language processing. Word collocation refers to word-to-word association using this linguistic phenomenon, which is a typical co-occurrence behavior among words (Firth, 1957, paper in Linguitics 1934-1951.London: Oxford University Press). The typicality of word collocations depends on the probabilistic nature of the collocations, as any collocations are possible, but only some are more relevant than others (Sinclair, 1966, bundling the Study of Lexis. London: Longman). Therefore, extracting "more appropriate" typical collocations becomes an important aspect of word collocation research. In the word collocation research based on the corpus, the method for automatically extracting typical word collocation by using probability information mainly comprises the following steps: counting the co-occurrence frequency of the collocation words and the node words; statistically measuring MI values (mutual information) among co-occurring terms; statistically measure T values (T-test) between co-occurring terms, and the like. Various methods can often obtain a great number of matches, but the judgment of whether a match is a qualified match is also a question of dispute, and no objective standard is provided for particularly good operation, and the extracted quality and the sorting and screening standard are difficult to further promote. It is also a difficult problem which extracted matches can be used effectively for language learning.
Collocation dictionaries are a common tool for language learning. However, the conventional matching dictionary is usually matched by listing a certain entry. The problem with this is that, first, the collocation dictionary becomes very thick, and it is difficult for the reader to learn the tail from the beginning; secondly, the reader can hardly know which collocation is the most typical collocation of the words, or can hardly form effective grouping associative memory because the words and the words have similar collocation in a certain word sense.
Semantic classification dictionaries, synonym classification dictionaries, and synonym resolution dictionaries are also commonly used learning tools for language learning. However, most of the conventional semantic classification dictionaries or synonym dictionaries are only lists of a series of synonyms or antonyms and congeneries, and the synonym recognition dictionary generally only shows the similarities and differences of the synonyms in a few example sentences or direct explanation modes. This approach often makes it difficult to fully characterize their usage differences; and is not intuitive; the word classification (synonyms, near-synonyms, antisense words, and the like) by simple manual induction is difficult to adapt to the development and change of language usage.
The study of combining word collocation and semantic classification resources also mostly describes the collocation in a certain entry of a collocation vocabulary in a semantic grouping way by using the semantic resources, for example, the possible collocation word of the word such as 'eat' is in a 'food' semantic class, namely 'a word + a certain group of semantic class words', which belongs to an explanatory study. Few researches utilize the attribute of a certain group of semantic words and a certain group of semantic words in collocation to verify and optimize the rationality of the semantic classification of the words and the universality of collocation in turn, and the researches belong to quantitative property researches or knowledge mining exploration.
Aiming at solving the difficulties and defects of the traditional problems and methods, the syntactic and semantic analysis and statistical method of the deep learning technology are combined to improve the quality of collocation extraction and screening; screening, sorting and matching and optimizing a vocabulary classification system by a method of iterative mutual verification of vocabulary classification and matching; meanwhile, the word classification and collocation are displayed by a many-to-many collocation matrix (mapping) card mode of word classification and collocation.
Disclosure of Invention
The invention provides a method for extracting vocabulary collocation and classifying semantics aiming at the defects in the prior art, and the method solves the problems of optimal selection of general collocation and word classification in the aspects of syntactic semantics and pragmatics. The method for combining vocabulary collocation and semantic classification resources is a method for exploring the aspects of vocabulary collocation and semantic knowledge mining by utilizing the quantitative characteristics calculated by combining vocabulary semantic classification and collocation.
In order to solve the technical problem, the invention is solved by the following technical scheme: a mining method combining vocabulary collocation extraction and semantic classification comprises the following steps: step A: vocabulary collocation extraction; and B: distinguishing the vocabulary semantic classification of parts of speech; and C: classifying words and calculating collocation to form many-to-many collocation mapping; sorting the matched words and word classifications according to the matching density, and splitting and recombining the word classifications after sorting; the steps are repeated until the word classification has no new splits and recombinations.
Preferably, in the above technical solution, in the step a, the method further includes the following steps:
step A1: marking the raw corpus by a syntactic and semantic dependency model based on deep learning, outputting a syntactic dependency tree or a semantic dependency graph, and extracting word collocation from the syntactic dependency tree or the semantic dependency graph;
step A2: filtering unqualified word collocation according to the mutual information, dependency distance and frequency of each word collocation based on the result of the syntactic dependency tree or the semantic dependency graph;
preferably, in the above technical solution, before step a1, the syntactic and semantic dependency model needs to be trained on the labeled corpus, and during the training, a model in the syntactic dependency tree or the semantic dependency graph may be selected, and one or more syntactic dependency trees or semantic dependency graph models may be trained simultaneously.
4. The method as claimed in claim 2, wherein in step A2, the frequency CF of each word collocation and the average dependency distance DD of each word collocation are counted, the mutual information PMI of each word collocation is calculated,
Figure BDA0002936149800000041
and setting a screening threshold Vcollocarion
Figure BDA0002936149800000042
Wherein the screening threshold VcollocationThe parameter criteria of (a) are: lambda [ alpha ]1Less than or equal to 80, not only leaving words with frequency of 80% or more ahead, lambda2Less than or equal to 80, the first 80% or more words and phrases of the mutual information are collocated, lambda3Not less than 10, not only abandoning the word collocation with the dependence distance more than or equal to 10 or longer; theta1、θ2And theta3Setting corresponding values according to the value ranges of the specific PMI, CF and DD;
if the word collocation does not conform to the screening threshold VcollocarionIf the word collocation meets the screening threshold VcollocationAccording to the parameter criteria of (1), then according to the screening threshold VcollocationFiltration is carried out.
Preferably, in the above technical solution, in the step B, the method further includes the following steps:
b1: classifying the vocabulary according to the part of speech to obtain part of speech classification;
b2: classifying the vocabularies in each word class classification into subclasses according to semantic relevance, using a bottom-up method, namely each vocabulary in an initial state is an independent subclass, merging the vocabularies in two subclasses into a new subclass according to whether the vocabularies in the two subclasses are synonyms, near-synonyms, antisense words or the like, traversing all the subclasses until no new merging exists, and defining the names of the subclasses if necessary;
b3: the method comprises the steps of dividing vocabularies in each part of speech classification into large classes according to semantic relevance, using a top-down method, namely summarizing the abstracted semantic large classes according to different parts of speech, then summarizing the small classes under the large classes, if the large classes cannot completely cover the small classes, adding or modifying the large classes, and finally summarizing all the small classes into the large classes, wherein the summarized small classes are the word classifications.
Preferably, in the above technical solution, in the step C, the method further includes the following steps:
c1: word classification obtains a series of one-to-many collocations according to word collocations:
Figure BDA0002936149800000051
c2: and (3) obtaining a many-to-many collocation matrix by combining the word classification words and the corresponding word collocation:
[W1,W2,…,Wm]*[C1,C2,…,Cn]T=[W1:C1,C2,…Cn]∩[W2:C1,C2,…Cn]∩…∩[Wm:C1,C2,…Cn];
c3: sorting the collocation words in each collocation matrix according to the collocation density thereof, and sorting the word classification in each collocation matrix according to the collocation density thereof;
c4: splitting or recombining the word classification according to the collocation density of the collocation matrix;
the collocation density of the word classification is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification and the collocation density of the collocation word is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification; the collocation density of the collocation matrix is the actual link quantity of the collocation matrix/the full link quantity of the collocation matrix;
c5: the C1, C2, C3, and C4 steps are repeated until the word class has no new splits and recombinations.
Preferably, in the above technical solution, in the step C4, the method further includes the following steps:
c4-1: if the collocation density of the items in the collocation matrix after the vocabulary classification vocabulary ordering is lower than the threshold value beta1Splitting the entries in the collocation matrix into two collocation matrices;
c4-2: if some words in the collocation matrix have similar collocation and the number of collocations with other words is less than the threshold value beta2Splitting the words into two collocation matrixes;
c4-3: if the two collocation matrixes are semantically related or similar subclasses, the collocation words in the collocation matrixes are similar, and if the collocation density of the combined collocation matrixes reaches a threshold value beta3Then the two sets of word classifications are merged.
A display method combining word collocation extraction and semantic classification is characterized in that collocation and word classification mapping obtained by a word collocation extraction and semantic classification method are displayed in a structure form of m x n or o x m cards, wherein n and o are collocation words, and m is a word in a category. The cards are displayed according to the hierarchical sequence of the vocabulary semantic classification system.
Preferably, in the above technical solution, the display mode is one or more of card display, book display, electronic card display, electronic book display, and electronic software display.
Preferably, in the above technical solution, if the mapping to be displayed exceeds the displayed range, the corresponding mapping is intercepted according to the sequence of the collocation words.
First, word collocation needs to be extracted in this application. The collocation mainly comprises a verb + noun, a noun + verb, a noun + noun, an adjective + noun, an adverb + adjective, an adverb + verb, and dependency syntax and dependency semantic analysis just reveal the syntactic semantic relationship between words. Syntactic dependency analysis has different labeling methods, and generally, there are about 10-30 different syntactic relations on which words depend. For example, the verb relationship between verbs and nouns, the main-predicate relationship between nouns and verbs, the modification relationship between adjectives and nouns, the modification relationship between adverbs and verbs, and the modification relationship between adverbs and adjectives. Semantic dependencies are semantic relationships that are more subdivided into word dependencies, such as, for example, the semantics that are subject include action, subject, scope, reason, etc., and the semantics that are subject include category, member, attribute, material, etc. In short, since dependency parsing and semantic parsing directly reveal word-to-word relationships, they are more detailed relationships; in addition, with the breakthrough of deep learning in the accuracy and speed of dependency syntax and semantic analysis in recent years, the possibility is provided for large-scale corpus processing. We choose to train the dependency syntactic semantic analysis model to process large-scale corpora, thereby providing collocation candidates with larger number, more subdivided types and more linguistic interpretability of the result. In a specific implementation, the input of the dependency syntactic semantic analysis is a sentence, the output is a dependency tree or a dependency graph formed by relations between words marked with parts of speech, and the relations between the words in the dependency tree or the dependency graph are word collocation which needs to be extracted.
Secondly, this application needs to filter word collocation. Filtering by various mathematical statistical methods on the resulting word collocation, including: mutual information, frequency, dependent distance, etc. The statistical analysis based on the dependency model result has two benefits, namely, from the aspect of processing precision, the statistical analysis is to extract screening collocation from different angles or evaluation scales; secondly, from the aspect of processing efficiency, if the dependency analysis is not used for primary processing, the calculation amount of the statistical method is much larger, and even if the word filtering is used for primary processing, the efficiency and the precision are much lower. In a specific implementation, for the dependency analysis result of the previous step, we can count the word frequency of each word, the frequency of each collocation, the total number of words, and the distance between each collocation, i.e., the dependency distance. The data is input in the step and is substituted into a mutual information formula to obtain a matched mutual information value; thus, the frequency, the dependence distance and the mutual information of each collocation are obtained; all collocations are sorted by setting a threshold value and comprehensively calculating, and the collocations with low comprehensive values are filtered out.
Thirdly, the application needs to distinguish lexical semantic classifications. As word classification, we need to distinguish parts of speech first, and classify words according to main parts of speech such as verbs, nouns, adjectives and adverbs. Then, on the basis, each part of speech is divided into large classes and small classes according to semantics by a bottom-up and top-down combination method, wherein the words are classified, namely the small classes are formed by grouping synonyms, near-synonyms, antisense words and similar words.
Aiming at further screening out collocation and word classification which have more general meanings for learning purposes, the collocation and word classification iterative mutual verification method is adopted in the application. First, we define the collocation of one kind of word with another kind of word as the collocation with general meaning. I.e., a collocation matrix of m x n, such as [ eat such verb ] + [ food class noun ]. The matching density (quantity), namely the number of matching links of [ eating such verbs ] and [ food nouns ], is usually higher, and the matching is more universal. Wherein, if the word classification and collocation word are fully concatenated, the collocation matrix (map) of m × n achieves a maximum collocation density (number) (m × n)/(m × n) ═ 1. Secondly, the words with similar grammatical semantics and pragmatics are classified and defined as the words with general meanings. Similarly, we refer to the matching density of the m × n matching matrix (mapping), and the larger the number of the m × n matching matrix, the more general the word classification as the same group has. In the concrete implementation, the collocation matrix is obtained through the word classification and collocation calculation, and then the higher collocation density is obtained through reordering, screening and splitting the combined word classification, so that more universal collocation and word classification are obtained. This process is iterative. Finally, through the step, more reasonable word classification is obtained, and matching sorting and screening corresponding to the word classification are carried out under the word classification.
To show collocation and word classification, we use many-to-many collocation matrix (mapping) cards. In a specific implementation, two columns or three columns are adopted. Taking nouns as an example, if a three-column mode is adopted, the noun classification phrase is in the middle column, the first column is a verb which can be an adjective collocation for modifying nouns or a verb forming a V-O structure with nouns; the third column may be nouns that constitute the name structure or verbs that constitute the subject-predicate structure. Taking adjectives as an example, if a two-column mode is adopted, the adjective classification word group is placed in the first column, and the second column can be a noun forming the adjective modification relation. It is noted that some languages may have a different endianness than chinese or english, such as french with a modified adjective generally following the noun. However, in summary, a total of three display modes are summarized: three-row display modes are used as word classification in the middle row, and the front row and the rear row are matched words; the two column patterns are classified as words in either the first column or the second column, with the remainder being collocations. We mentioned earlier that our word classification is first classified according to nouns, verbs, adjectives and adverbs, which are the main stems of sentences in most languages, accounting for more than ninety-five percent of words, and the extracted six main collocations are just the combinations of these four parts of speech. When language words are classified according to parts of speech, then the language words are classified into synonyms, near-synonyms, anti-synonyms and similar words according to semantic classification, and the words are matched to form a matching matrix (mapping) card, the whole word system forms a word classification matching card book. The books and cards can be paper books or electronic books, and can be output in an electronic software mode.
Compared with the traditional method of extracting and filtering collocation by combining a dependency analysis and statistical method, the method has the advantages that firstly, the syntactic semantics and mathematical statistical attributes of collocation are fully utilized so as to improve the extraction quality; and secondly, the processing efficiency is improved by a two-layer processing mode so as to be convenient for acting on larger-scale linguistic data, obtain more collocation numbers and various statistical data and finally obtain more collocation which is more in line with real use.
The method disclosed by the application solves the problems of optimizing and selecting general collocation and word classification in a self-consistent way from the perspective of syntactic semantics and pragmatics. The method for combining vocabulary collocation and semantic classification resources is a method for exploring the aspects of vocabulary collocation and semantic knowledge mining by utilizing the quantitative characteristics calculated by combining vocabulary semantic classification and collocation. This provides an objective basis for collocation and semantic classification resource construction. Meanwhile, the collocation and word classification are shown in a many-to-many linked collocation matrix card mode, the method is visual and vivid, the information density is high, the semantic language is combined with a grouping link mode, and a learning memory card and a testing evaluation card which are expanded from the grouping link mode, and a card book consisting of the cards are in the forms of a paper book, an electronic book and language learning software, and are all suitable for the characteristics of the language and word learning book software of the method.
Drawings
FIG. 1 is an example of a syntactic dependency tree.
FIG. 2 is an example of a semantic dependency graph.
Fig. 3 shows an example of a word classification collocation matrix (mapping) for three columns.
Fig. 4 shows a word classification collocation matrix (mapping) example 1 for two columns.
Fig. 5 shows a word classification collocation matrix (mapping) example 2 for two columns.
FIG. 6 is a flow chart of the system of the present application.
Detailed Description
In order to make those skilled in the art better understand the technical solution of the present invention, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the drawings in the embodiment of the present invention, and it is obvious that the described embodiment is only a part of the embodiment of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention. The present invention will be described with reference to the accompanying drawings and specific embodiments.
A method for vocabulary collocation extraction and semantic classification comprises the steps of firstly training a syntactic and semantic dependency model.
A syntactic labeled dependency tree (fig. 1) or semantic dependency graph (fig. 2) is obtained. It also labels part of speech, for Chinese already word segmentation processing, for inflected language such as English French also has part of speech processing. Processing for any language conforms to the corpus specification.
A syntactic and semantic dependency model based on deep learning is designed, the syntactic and semantic dependency model is dependency analysis based on a graph algorithm, a Biaffine function is adopted to calculate the dependency relationship between words, and a Bilean function is adopted to calculate the dependency type between words.
(a)
Figure BDA0002936149800000111
(b)
Figure BDA0002936149800000112
If the output of the dependency relationship is a dependency tree, adopting an argmax target function, and if the output of the dependency relationship is a dependency graph, adopting a sigmoid target function; the outputs of the dependency type all use argmax objective functions.
(c)
Figure BDA0002936149800000113
(d)
Figure BDA0002936149800000114
(e)
Figure BDA0002936149800000115
The specific implementation mode is as follows: the Encoder Encoder of the neural network model adopts a BilSTM structure, an input vector is formed by connecting Word-embedding and POS-embedding of each sentence, and the Encoder outputs a hidden state h of each sentence sequencet. Decoder of neural network model by output h of encodertAdjusting vector dimension through FNN (fed Neural network) to obtain a vector of a head word; and then, by taking the current word of the sentence plus the previous word and the next word as characteristics, obtaining the vector of the dependent word through FNN vector dimension adjustment. That is to say that the first and second electrodes,
(f)
Figure BDA0002936149800000121
(g)
Figure BDA0002936149800000122
(h)
Figure BDA0002936149800000123
(i)
Figure BDA0002936149800000124
and (3) substituting the formulas (h) and (i) into the target functions of the formulas (c), (f) and (e) to obtain the overall framework of the neural network model.
If the final output is a syntactic dependency tree, a qualified non-projective tree (non-projective tree) can be obtained by adopting a Chu-Liu-Edmonds algorithm commonly used in dependency syntactic analysis, or a qualified projective tree (projective tree) can be obtained by decoding an Eisner algorithm; the output can be directly output if the output is a semantic dependency Graph, or can be added, such as an AD algorithm, if the dependency Graph needs to satisfy DAG (Directed Acyclic Graph 3) constraints. And training a syntactic semantic dependency model on the labeled corpus. One of the syntactic dependency trees or semantic dependency graphs may be selected for training. To achieve better results, one or more syntactic dependency trees or semantic dependency graph models may also be trained simultaneously.
And after the syntactic semantic dependency model training is finished, obtaining large-scale raw corpora. In order to better cover the real use condition of the language, the selection of the language material can comprehensively consider various languages and texts in various fields. And then, a new syntactic dependency tree or semantic dependency graph is obtained through the syntactic semantic dependency model labeling. Extracting main six word collocations from the obtained syntactic dependency tree or semantic dependency graph: verb + noun, adverb + verb, noun + verb, adverb + adjective, adjective + noun, noun + noun.
And filtering partial unqualified word collocation by using mathematical statistical methods such as mutual information, dependence distance, frequency and the like based on dependence analysis results after the word collocation extraction is finished. These include the frequency of each word collocation, cf, (collocation frequency), the average dependency distance dd (dependency distance) of each word collocation, the word frequency of each word, etc. If one or more syntactic dependency trees or semantic dependency graph models are used simultaneously, a composite score can be obtained for the collocation frequency, the average dependency distance and the like extracted by each model.
Calculating mutual information of each collocation: PMI (poitwise Mutual information),
Figure BDA0002936149800000131
calculating a comprehensive formula, and setting a threshold value:
Figure BDA0002936149800000132
sorting the screened matches, where λ can be set1And λ2Is 80% of lambda3To 10, i.e. leave the collocation with PMI value and frequency value at the first 80% and average dependent distance less than 10, then set θ1、θ2And theta3Within a reasonable range, the PMI value, the frequency value and the average dependence distance are respectively corresponding to the comprehensive value V of each collocationcollocationHas reasonable contribution. It is noted that the reciprocal of the average dependency distance is used here because the smaller the average dependency distance, the smaller the score V of the collocationcollocationThe higher should be. Parameter θ in the formula1、θ2、θ3One or more of which may be zero. Other statistics may be added to the above formula in a similar manner, such as a collocated T-test value.
The vocabulary is classified according to nouns, verbs, adjectives, adverbs and the like. Then, the words in each word class are divided into large classes and small classes according to semantics by a bottom-up and top-down combination method. The words are grouped into subclasses by a bottom-up method, and the specific method is that each word is a subclass in an initial state, and then the words are combined into a group according to whether the words of the two subclasses are synonyms, near-synonyms, antisense words or similar words, so that the correlation of semantics is considered, and meanwhile, the words can be matched similarly, namely, the subclasses can be sets of synonyms, near-synonyms, antisense words or similar words or combinations of the words. All groups are traversed until there are no new merges. It is worth noting that some groups can be combined with a plurality of groups, such as the polysemous word "apple" can be combined with "pear" to form a "fruit" group, or can be combined with "Hewlett packard" to form a "computer" group, and no matter how the process is repeated, the number of groups is reduced, and the vocabulary of each group is increased. Finally, a name may be defined for each group or subclass, such as "fruit" within the noun, and "agree or disagree" within the verb.
Then, the semantic major classes are induced by a top-down method, and each subclass is induced below the semantic major classes, specifically, the abstracted semantic major classes are induced according to different parts of speech, for example, verbs are divided into mental verbs, behavioral verbs, body verbs, character changes, object relationships, social activities and the like, for example, adverbs are divided into time adverbs, place adverbs, degree adverbs, frequency adverbs, mode adverbs and the like, and adjectives are divided into state situations, talent, thinking activities, time spaces, quantity degrees, scientific culture, social lives and the like. And then, the subclasses obtained in the previous step are summarized below the major classes, if the major classes cannot completely cover the subclasses, the major classes need to be added or modified, and finally all the subclasses are reasonably summarized into a classification system. The classified subclasses are word classifications.
If more classification levels need to be added, such as adding a middle class between the big class and the small class, the big class can be subdivided into middle classes from top to bottom and then the small classes are classified under the middle classes, or the small classes are combined into middle classes from bottom to top and then the big classes are summarized, or the two ways are combined.
Then the word classification obtains a series of one-to-many collocations according to the word collocations:
Figure BDA0002936149800000151
then, the word classification vocabulary and the corresponding word collocation are gathered to obtain a many-to-many collocation matrix:
[W1,W2,…,Wm]*[C1,C2,…,Cn]T=[W1:C1,C2,…Cn]∩[W2:C1,C2,…Cn]∩…∩[Wm:C1,C2,…Cn];
sorting the collocation words in each collocation matrix according to the collocation density;
sorting the word classification in each collocation matrix according to the collocation density;
splitting or recombining the word classification according to the collocation density of the collocation matrix, specifically, if the collocation density of the items of the word classification words in the collocation matrix after the word classification words are sequenced is lower than a threshold value beta1Splitting the entries in the collocation matrix into two collocation matrices; if some words in the collocation matrix have similar collocation and the number of collocations with other words is less than the threshold value beta2Splitting the words into two collocation matrixes; if the two collocation matrixes are semantically related or similar subclasses, the collocation words in the collocation matrixes are similar, and if the collocation density of the combined collocation matrixes reaches a threshold value beta3Then the two sets of word classifications are merged. Here beta1Can be between 0.1 and 0.2, beta2The value range can be between 1 and 5, beta3The value range of (A) can be between 0.3 and 1.0.
The collocation density of the word classification is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification and the collocation density of the collocation word is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification; the collocation density of the collocation matrix is the actual link quantity of the collocation matrix/the full link quantity of the collocation matrix. And repeating the steps until the word classification has no new separation and recombination.
Displaying the obtained final word classification many-to-many collocation matrix in a card form of m x n or m x n x o; the cards are displayed according to the hierarchical sequence of the major classes, the middle classes and the minor classes of the vocabulary semantic classification system. The whole vocabulary classification system can be presented through one or more of card exhibition, book exhibition, electronic card exhibition, electronic book exhibition and electronic software exhibition. Generally, the whole vocabulary classification system for learning purpose is shown in the form of card book, i.e. the card book composed of cards is suitable for editing and making paper book, electronic book and language learning software.
If the card form is displayed, the finally obtained word classification collocation matrix is displayed on the card in an m × n or m × n × o arrangement specification. It is noted that the word classification collocation matrix obtained in the previous step has two types: one is a combination mapping of word classification (subclass) + collocation word, and the other is a combination mapping of collocation word + word classification (subclass). In addition, the number of collocations is often much larger than the word classification (subclass), for example, the word classification is 10 words, and the collocations may have more than 100 words. Only the most important collocations are needed to be selected and left as a card display, and only the collocations with the front sequence need to be intercepted as the collocations and the word classifications are sequenced according to the collocations density through the previous steps. For example, the previous example may sort 10 x 12 cards, i.e., m x n word classification (synonym/synonym) collocation cards. The other two m x n cards (word classification + collocation words, collocation words + word classification) can be combined into m x n x o cards (collocation words + word classification + collocation words). It is worth noting that this merging is a simple merging, aiming at more compact information, and the first and third columns are not directly related, but in practice, this often characterizes the collocation of the word classes in the middle column before and after.
The display modes of m by n are divided into two modes: the word classification (subclass) is in the first column, collocated words in the latter column, see fig. 3; or the word classification is in the second column with collocated words in the previous column, see fig. 4; the three columns m x n x o are shown with the word classes in the middle column and collocations in the front and back columns, see fig. 5.
For different languages, the word sequence of word collocation may be different, for example, most of Chinese and English are adjectives + nouns, most of Chinese and English are nouns + adjectives, most of Chinese and English are adverbs + verbs, and a small part of Chinese and English are verbs + adverbs; most French is noun + adjective, and there are few adjectives + nouns. However, all three representations can be generalized.
The cards and the card books of a many-to-many word classification collocation matrix (mapping) are visual and vivid, and have high information density (if the card book consisting of 5 × 5 cards has an average collocation density of 0.5, namely the actual number of links is half of the number (5 × 5)/2 of full links, compared with the mode of one example sentence in a line or collocation using method of a traditional book, the information density is higher by (5 × 5)/2/5 times of 2.5), and the semantic language uses a combined grouping link mode, and the learning memory card and the testing evaluation card expanded from the semantic language are both suitable for the characteristics of the language and vocabulary learning book software of the method.
The collocation extraction and semantic classification method disclosed by the application is suitable for various languages, and because the method does not relate to the logicality of articles and sentences and only judges words, the languages with different grammatical structures can be applied, such as Chinese, French, English, Japanese, Korean and the like. When encountering uncommon languages, the method can be used under the condition of knowing grammar rules.

Claims (10)

1. A mining method combining vocabulary collocation extraction and semantic classification is characterized by comprising the following steps:
step A: vocabulary collocation extraction;
and B: distinguishing the vocabulary semantic classification of parts of speech;
and C: classifying words and calculating collocation to form many-to-many collocation mapping; sorting the matched words and word classifications according to the matching density, and splitting and recombining the word classifications after sorting; the steps are repeated until the word classification has no new splits and recombinations.
2. The method of claim 1, wherein the step a further comprises the following steps:
step A1: marking the raw corpus by a syntactic and semantic dependency model based on deep learning, outputting a syntactic dependency tree or a semantic dependency graph, and extracting word collocation from the syntactic dependency tree or the semantic dependency graph;
step A2: the results based on the syntactic dependency tree or semantic dependency graph filter the ineligible word collocations according to the mutual information, dependency distance and frequency of each word collocation.
3. The method as claimed in claim 2, wherein before step a1, the syntactic-semantic dependency model is trained on the labeled corpus, wherein the syntactic dependency tree or semantic dependency graph model can be selected during training, and one or more syntactic dependency tree or semantic dependency graph models can be trained simultaneously.
4. The method as claimed in claim 2, wherein in step A2, the frequency CF of each word collocation and the average dependency distance DD of each word collocation are counted, the mutual information PMI of each word collocation is calculated,
Figure FDA0002936149790000021
and setting a screening threshold Vcollocation
Figure FDA0002936149790000022
Wherein the screening threshold VcollocationThe parameter criteria of (a) are: lambda [ alpha ]1Less than or equal to 80, not only leaving words with frequency of 80% or more ahead, lambda2Less than or equal to 80, the first 80% or more words and phrases of the mutual information are collocated, lambda3Not less than 10, not only abandoning the word collocation with the dependence distance more than or equal to 10 or longer; theta1、θ2And theta3Setting corresponding values according to the value ranges of the specific PMI, CF and DD;
if the words are collocatedNon-compliance with the screening threshold VcollocationIf the word collocation meets the screening threshold VcollocationAccording to the parameter criteria of (1), then according to the screening threshold VcollocationFiltration is carried out.
5. The method of claim 1, wherein the step B further comprises the following steps:
b1: classifying the vocabulary according to the part of speech to obtain part of speech classification;
b2: classifying the vocabularies in each word class classification into subclasses according to semantic relevance, using a bottom-up method, namely each vocabulary in an initial state is an independent subclass, merging the vocabularies in two subclasses into a new subclass according to whether the vocabularies in the two subclasses are synonyms, near-synonyms, antisense words or the like, traversing all the subclasses until no new merging exists, and defining the names of the subclasses if necessary;
b3: the method comprises the steps of dividing vocabularies in each part of speech classification into large classes according to semantic relevance, using a top-down method, namely summarizing the abstracted semantic large classes according to different parts of speech, then summarizing the small classes under the large classes, if the large classes cannot completely cover the small classes, adding or modifying the large classes, and finally summarizing all the small classes into the large classes, wherein the summarized small classes are the word classifications.
6. The method of claim 1, wherein the step C further comprises the following steps:
c1: word classification obtains a series of one-to-many collocations according to word collocations:
Figure FDA0002936149790000031
c2: and (3) obtaining a many-to-many collocation matrix by combining the word classification words and the corresponding word collocation:
[W1,W2,…,Wm]*[C1,C2,…,Cn]T=[W1:C1,C2,…Cn]∩[W2:C1,C2,…Cn]∩…∩[Wm:C1,C2,…Cn];
c3: sorting the collocation words in each collocation matrix according to the collocation density thereof, and sorting the word classification in each collocation matrix according to the collocation density thereof;
c4: splitting or recombining the word classification according to the collocation density of the collocation matrix;
the collocation density of the word classification is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification and the collocation density of the collocation word is the actual link quantity of the word classification and the collocation word/the full link quantity of the word classification; the collocation density of the collocation matrix is the actual link quantity of the collocation matrix/the full link quantity of the collocation matrix;
c5: the C1, C2, C3, and C4 steps are repeated until the word class has no new splits and recombinations.
7. The method of claim 1, wherein in step C4, the method further comprises the following steps:
c4-1: if the collocation density of the items in the collocation matrix after the vocabulary classification vocabulary ordering is lower than the threshold value beta1Splitting the entries in the collocation matrix into two collocation matrices;
c4-2: if some words in the collocation matrix have similar collocation and the number of collocations with other words is less than the threshold value beta2Splitting the words into two collocation matrixes;
c4-3: if the two collocation matrixes are semantically related or similar subclasses, the collocation words in the collocation matrixes are similar, and if the collocation density of the combined collocation matrixes reaches a threshold value beta3Then the two sets of word classifications are merged.
8. A display method combining word collocation extraction and semantic classification is characterized in that collocation and word classification mapping obtained by the word collocation extraction and semantic classification method of any one of claims 1 to 7 is displayed in a m x n or o x m card structure form, wherein n and o are collocation words, and m is words in categories. The cards are displayed according to the hierarchical sequence of the vocabulary semantic classification system.
9. The method as claimed in claim 8, wherein the display mode is one or more of card display, book display, electronic card display, electronic book display, and electronic software display.
10. The method as claimed in claim 9, wherein if the mapping to be displayed exceeds the display range, the mapping is intercepted according to the sequence of the matching words.
CN202110162745.8A 2021-02-05 2021-02-05 Mining and displaying method combining vocabulary collocation extraction and semantic classification Pending CN112860781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110162745.8A CN112860781A (en) 2021-02-05 2021-02-05 Mining and displaying method combining vocabulary collocation extraction and semantic classification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110162745.8A CN112860781A (en) 2021-02-05 2021-02-05 Mining and displaying method combining vocabulary collocation extraction and semantic classification

Publications (1)

Publication Number Publication Date
CN112860781A true CN112860781A (en) 2021-05-28

Family

ID=75988613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110162745.8A Pending CN112860781A (en) 2021-02-05 2021-02-05 Mining and displaying method combining vocabulary collocation extraction and semantic classification

Country Status (1)

Country Link
CN (1) CN112860781A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742459A (en) * 2021-11-05 2021-12-03 北京世纪好未来教育科技有限公司 Vocabulary display method and device, electronic equipment and storage medium
CN113901791A (en) * 2021-09-15 2022-01-07 昆明理工大学 A Dependency Syntax Parsing Method for Fusion Multi-Strategy Data Augmentation under Low Resource Conditions
CN118013017A (en) * 2024-03-12 2024-05-10 云观视角数字科技(陕西)有限公司 Intelligent text automatic generation method based on AI large language model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268160A (en) * 2014-09-05 2015-01-07 北京理工大学 Evaluation object extraction method based on domain dictionary and semantic roles
US20170132205A1 (en) * 2015-11-05 2017-05-11 Abbyy Infopoisk Llc Identifying word collocations in natural language texts
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Extraction Method of Entity Attributes and Attribute Values Based on Multi-granularity Semantic Blocks
CN109086269A (en) * 2018-07-19 2018-12-25 大连理工大学 A kind of equivocacy language recognition methods indicated based on semantic resources word with Matching Relation
CN109299455A (en) * 2017-12-20 2019-02-01 北京联合大学 A computer language processing method for Chinese gerunds with unusual collocations

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268160A (en) * 2014-09-05 2015-01-07 北京理工大学 Evaluation object extraction method based on domain dictionary and semantic roles
US20170132205A1 (en) * 2015-11-05 2017-05-11 Abbyy Infopoisk Llc Identifying word collocations in natural language texts
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Extraction Method of Entity Attributes and Attribute Values Based on Multi-granularity Semantic Blocks
CN109299455A (en) * 2017-12-20 2019-02-01 北京联合大学 A computer language processing method for Chinese gerunds with unusual collocations
CN109086269A (en) * 2018-07-19 2018-12-25 大连理工大学 A kind of equivocacy language recognition methods indicated based on semantic resources word with Matching Relation

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
俞士汶、黄居仁: "计算语言学前瞻", 31 August 2005, 商务印书馆, pages: 23 - 28 *
努力的番茄: "机器学习python实现", pages 3, Retrieved from the Internet <URL:《博客园》> *
周志华: "机器学习", 31 January 2016, 清华大学出版社, pages: 130 *
张继东: "基于语料库的英语语言特征研究", 31 May 2012, 上海交通大学出版社, pages: 169 - 174 *
李斌: "动宾搭配的语义分析和计算", 30 November 2011, 世界图书出版公司, pages: 21 - 26 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113901791A (en) * 2021-09-15 2022-01-07 昆明理工大学 A Dependency Syntax Parsing Method for Fusion Multi-Strategy Data Augmentation under Low Resource Conditions
CN113901791B (en) * 2021-09-15 2022-09-23 昆明理工大学 A Dependency Syntax Parsing Method for Fusion Multi-Strategy Data Augmentation under Low Resource Conditions
CN113742459A (en) * 2021-11-05 2021-12-03 北京世纪好未来教育科技有限公司 Vocabulary display method and device, electronic equipment and storage medium
CN118013017A (en) * 2024-03-12 2024-05-10 云观视角数字科技(陕西)有限公司 Intelligent text automatic generation method based on AI large language model
CN118013017B (en) * 2024-03-12 2024-07-05 云观视角数字科技(陕西)有限公司 Intelligent text automatic generation method based on AI large language model

Similar Documents

Publication Publication Date Title
CN105005553B (en) Short text Sentiment orientation analysis method based on sentiment dictionary
CN102298635B (en) Method and system for fusing event information
CN102165435B (en) Automatic context sensitive language generation, correction and enhancement using an internet corpus
Pal et al. An approach to automatic text summarization using WordNet
CN105022805B (en) A kind of sentiment analysis method based on SO-PMI information on commodity comment
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
US20110196670A1 (en) Indexing content at semantic level
CN112989802B (en) A barrage keyword extraction method, device, equipment and medium
CN114706972B (en) An automatic generation method of unsupervised scientific and technological information summaries based on multi-sentence compression
CN111309925A (en) Knowledge graph construction method of military equipment
CN106599032A (en) Text event extraction method in combination of sparse coding and structural perceptron
CN101520802A (en) Question-answer pair quality evaluation method and system
CN109783806B (en) Text matching method utilizing semantic parsing structure
CN112860781A (en) Mining and displaying method combining vocabulary collocation extraction and semantic classification
CN111221968B (en) Author disambiguation method and device based on subject tree clustering
US20220180317A1 (en) Linguistic analysis of seed documents and peer groups
CN101702167A (en) A Method of Extracting Attributes and Comments from Templates Based on Internet
CN110209818A (en) A kind of analysis method of Semantic-Oriented sensitivity words and phrases
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
WO2008059111A2 (en) Natural language processing
Sebti et al. A new word sense similarity measure in WordNet
US20240281606A1 (en) Linguistic analysis of seed documents and peer groups
Hathout Acquisition of morphological families and derivational series from a machine readable dictionary
Xu et al. Expanding Chinese sentiment dictionaries from large scale unlabeled corpus
Temesgen Afaan Oromo News Text Summarization Using Sentence Scoring Method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination