CN107679174A - Construction method, device and the server of Knowledge Organization System - Google Patents
Construction method, device and the server of Knowledge Organization System Download PDFInfo
- Publication number
- CN107679174A CN107679174A CN201710909279.9A CN201710909279A CN107679174A CN 107679174 A CN107679174 A CN 107679174A CN 201710909279 A CN201710909279 A CN 201710909279A CN 107679174 A CN107679174 A CN 107679174A
- Authority
- CN
- China
- Prior art keywords
- word
- class
- gathering
- original document
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides the construction method of Knowledge Organization System, device and server.This method includes:Obtain the related multiple original documents of target domain;Multiple original documents are clustered to generate the conjunction of first kind gathering;Preset word list and word relation list according to corresponding to target domain, each class cluster screens in being closed to first kind gathering, is closed with generating the second class gathering including the class cluster associated with target domain;Closed according to the second class gathering to update preset word list and word relation list.Compared to the construction method of Knowledge Organization System in the prior art, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, screened, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improves the efficiency of structure Knowledge Organization System;And reduce due to error caused by human factor so that the Knowledge Organization System of structure is more accurate.
Description
Technical field
The present invention relates to field of computer technology, and specifically, the present invention relates to the construction method of Knowledge Organization System, dress
Put and server.
Background technology
Knowledge Organization System (Knowledge Organization System, KOS) is that one kind can be managed by computer
Solution, the system for reading and identifying, the system mainly include:The various vocabularys such as word, descriptor are described, are the bases of current searching system
Infrastructure.
Prior art is when building Knowledge Organization System, in order that the Knowledge Organization System that must be built is more accurate, big portion
The need of work divided manually participates in, for example, when being collected into a large amount of words (for example, keyword, the descriptor etc.) in some field,
Artificial (for example, person skilled in art) is needed to be classified to these words, associated.And with the continuous expansion of information
Fill, it is necessary to substantial amounts of manpower is updated to Knowledge Organization System every now and then, for example, various vocabularys are more in Knowledge Organization System
Newly, the work of this part also relies primarily on manual examination and verification, screening etc. and suitable word is added in corresponding vocabulary.
From the method for above-mentioned existing structure Knowledge Organization System:It is when building Knowledge Organization System, it is necessary to a large amount of
Manpower knowledge system is updated and safeguarded, therefore, in actual applications, in order to economize on resources, relevant departments are generally not
The shorter cycle can be set to be updated and safeguard knowledge system, made troubles to using the user of the Knowledge Organization System,
For example, user possibly can not retrieve related data in time.In addition, this rely primarily on artificial constructed Knowledge Organization System
Method, because workload is larger, human error occurs unavoidably, so as to cause the Knowledge Organization System of structure is inaccurate to ask
Topic.
The content of the invention
In view of the foregoing, the invention provides the construction method of Knowledge Organization System, device and server, know in structure
During knowing organization system, the workload of user is reduced, compared to the method for building knowledge knowledge system in the prior art, sheet
The construction method that invention provides more automates, and improves the efficiency of structure Knowledge Organization System.
The embodiments of the invention provide a kind of construction method of Knowledge Organization System, including:
Obtain the related multiple original documents of target domain;
Multiple original documents are clustered to generate the conjunction of first kind gathering;
Preset word list and word relation list according to corresponding to target domain, each class cluster is carried out in being closed to first kind gathering
Screening, closed with generating the second class gathering including the class cluster associated with target domain;
Closed according to the second class gathering to update preset word list and word relation list.
Preferably, preset word list and word relation list according to corresponding to target domain, it is each in being closed to first kind gathering
Class cluster is screened, to generate the step of the second class gathering including the class cluster associated with target domain is closed, including:
According to preset word list and word relation list, determine preset word list and word that word relation list includes is first
The number occurred during class gathering is closed in each class cluster;
According to the number determined, from the conjunction of the first kind gathering class cluster associated with target domain is filtered out to generate the
Two class gatherings are closed.
Preferably, the step of being closed according to the second class gathering to update preset word list, including:
Determine word that preset word list included and the second class gathering close in the word that is included of each class cluster between
Similarity;
The similarity determined in being closed according to the second class gathering updates preset word list less than the word of predetermined threshold.
Preferably, the step of being closed according to the second class gathering to update preset word relation list, including:
Target word is extracted from each original document in the conjunction of the second class gathering;
The contextual information of foundation either objective word, determine the relation vector of the target word;
According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;
According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined
The word relation of word is marked, and preset word relation list is updated according to word relation.
Preferably, after the step of the second class gathering of the generation including the class cluster associated with target domain is closed, in addition to:
Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering;
Word duplicate removal to meeting default part of speech and/or word length, and according to the art of the word generation target domain after duplicate removal
Language set;
Wherein, closed according to the second class gathering to update preset word list, including:
Preset word list is updated according to the term set of target domain.
Preferably, the construction method also includes:
According to the term set of target domain, determine the second class gathering close in each original document index degree;
According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;
Wherein, closed according to the second class gathering to update preset word relation list:
Preset word relation list corresponding to target domain is updated according to original document up to standard in the conjunction of the second class gathering.
Preferably, according to term set, determine the second class gathering close in the index of each original document the step of spending it
Before, in addition to:
Extract the target word that each original document during the second class gathering is closed includes;
Wherein, according to term set, the step of index of each original document in the conjunction of the second class gathering is spent, bag are determined
Include:
According to the word contained by term set, each original document in being closed to the second class gathering indexes;
The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document
The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
Preferably, according to term set, the step of index of each original document in the conjunction of the second class gathering is spent, bag are determined
Include:
Determine the quantity of the target word of any original document in the conjunction of the second class gathering, and the target of the original document
Include the quantity of word in term set in word;
According to the quantity of the target word of the original document, and include terminology in the target word of the original document
The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
The embodiment of the present invention also provides a kind of construction device of Knowledge Organization System, including:
Acquiring unit, cluster cell, screening unit and updating block, wherein:
Acquiring unit is used to obtain the related multiple original documents of target domain;
Cluster cell is used to cluster multiple original documents to generate the conjunction of first kind gathering;
Screening unit is used for preset word list and word relation list according to corresponding to target domain, in being closed to first kind gathering
Each class cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain;
Updating block is used to close to update preset word list and word relation list according to the second class gathering.
Preferably, screening unit is specifically used for:
According to preset word list and word relation list, determine preset word list and word that word relation list includes is first
The number occurred during class gathering is closed in each class cluster;According to the number determined, filtered out from the conjunction of first kind gathering and target
The associated class cluster in field is closed with generating the second class gathering.
Preferably, updating block is specifically used for:
Determine word that preset word list included and the second class gathering close in the word that is included of each class cluster between
Similarity;
The similarity determined in being closed according to the second class gathering updates preset word list less than the word of predetermined threshold.
Preferably, updating block is specifically used for:
Target word is extracted from each original document in the conjunction of the second class gathering;
The contextual information of foundation either objective word, determine the relation vector of the target word;
According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;
According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined
The word relation of word is marked, and preset word relation list is updated according to word relation.
Preferably, in addition to:Generation unit, generation unit are used for:
After the step of the second class gathering including the class cluster associated with target domain is closed is generated, from the second class gathering
Extraction meets the word of default part of speech and/or word length in each original document in conjunction;To meet default part of speech and/or word length
Word duplicate removal, and according to the term set of the word generation target domain after duplicate removal;
Wherein, updating block is specifically used for:
Preset word list is updated according to the term set of target domain.
Preferably, in addition to:Determining unit, determining unit are used for:
According to the term set of target domain, determine the second class gathering close in each original document index degree;
According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;
Wherein, updating block is specifically used for:
Preset word relation list corresponding to target domain is updated according to original document up to standard in the conjunction of the second class gathering.
Preferably, the construction device also includes:Extraction unit, extraction unit are used for:
Extract the target word that each original document during the second class gathering is closed includes;
Wherein it is determined that unit is specifically used for:
According to the word contained by term set, each original document in being closed to the second class gathering indexes;
The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document
The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
Preferably, determining unit is specifically used for:
Determine the quantity of the target word of any original document in the conjunction of the second class gathering, and the target of the original document
Include the quantity of word in term set in word;
According to the quantity of the target word of the original document, and include terminology in the target word of the original document
The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
The embodiment of the present invention also provides a kind of server, including memory and processor, and memory, which is used to store, includes journey
The information of sequence instruction, processor are used for the execution of control program instruction, the embodiment of the present invention are realized when program is executed by processor
The step of construction method of any Knowledge Organization System provided.
Had the beneficial effect that using what the embodiment of the present invention obtained:
In embodiments of the present invention, after the related multiple original documents of target domain are obtained, by the plurality of original
File is clustered, and obtains multiple class clusters (conjunction of first kind gathering) of the target domain;Further according to pre- corresponding to the target domain
Put word list and word relation list, the plurality of class cluster screened, remove some it is uncorrelated to the target domain or association compared with
Small class cluster, i.e.,:Leave the gathering associated with the target domain and close (conjunction of the second class gathering);Close to come more according to the second class gathering
The new preset word list and word relation list.Using the embodiment of the present invention, the related multiple original documents of target domain are being obtained
Afterwards, this multiple original document is clustered automatically, screened, obtain the class cluster related to the target domain, and pass through the screening
Class cluster afterwards is updated to word list and word relation list preset in the target domain automatically;Compared to knowing in the prior art
Know the construction method of organization system, the embodiment of the present invention need not call substantial amounts of manpower to the original in the target domain of substantial amounts
Beginning file is classified, screened, and preset word list in target domain and word relation list are updated, and greatlys save
Resource, while improve the efficiency of structure Knowledge Organization System;And reduce due to error caused by human factor so that structure
The Knowledge Organization System built is more accurate.
The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Brief description of the drawings
Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments
Substantially and it is readily appreciated that, wherein:
Fig. 1 is a kind of schematic flow sheet of the construction method for Knowledge Organization System that the embodiment of the present invention 1 provides;
Fig. 2 is that the second class of a kind of foundation gathering that the embodiment of the present invention 1 provides is closed to update the schematic diagram of preset word list;
Fig. 3 is that a kind of the second class of foundation gathering of the embodiment of the present invention 1 is closed to update the schematic diagram of preset word relation list;
Fig. 4 is the schematic diagram of the flow of the construction method for another Knowledge Organization System that the embodiment of the present invention 2 provides;
Fig. 5 is a kind of structural representation of the construction device of Knowledge Organization System of the embodiment of the present invention 3;
Fig. 6 is a kind of structural representation for server that the present invention implements 4.
Embodiment
Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end
Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached
The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.
Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one
It is individual ", " described " and "the" may also comprise plural form.It is to be further understood that what is used in the specification of the present invention arranges
Diction " comprising " refer to the feature, integer, step, operation, element and/or component be present, but it is not excluded that in the presence of or addition
One or more other features, integer, step, operation, element, component and/or their groups.It should be understood that when we claim member
Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or there may also be
Intermediary element.In addition, " connection " used herein or " coupling " can include wireless connection or wireless coupling.It is used herein to arrange
Taking leave "and/or" includes whole or any cell and all combinations of one or more associated list items.
Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art
Language and scientific terminology), there is the general understanding identical meaning with the those of ordinary skill in art of the present invention.Should also
Understand, those terms defined in such as general dictionary, it should be understood that have with the context of prior art
The consistent meaning of meaning, and unless by specific definitions as here, idealization or the implication of overly formal otherwise will not be used
To explain.
The technical scheme of various embodiments of the present invention is specifically introduced below in conjunction with the accompanying drawings.
Embodiment 1
The embodiments of the invention provide a kind of construction method of Knowledge Organization System, schematic flow sheet such as Fig. 1 of this method
It is shown, specifically include following steps:
S101:Obtain the related multiple original documents of target domain.
S102:Multiple original documents are clustered to generate the conjunction of first kind gathering.
S103:Preset word list and word relation list according to corresponding to target domain, each class in being closed to first kind gathering
Cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain.
S104:Closed according to the second class gathering to update preset word list and word relation list.
It is automatically multiple original to this after the related multiple original documents of target domain are obtained using the embodiment of the present invention
File is clustered, screened, and obtains the class cluster related to the target domain, and by the class cluster after the screening automatically to the target
Preset word list and word relation list are updated in field;Compared to the structure side of Knowledge Organization System in the prior art
Method, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, sieved
Choosing, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improve structure
Build the efficiency of Knowledge Organization System;And reduce due to error caused by human factor so that the Knowledge Organization System of structure
It is more accurate.
The specific implementation of each step is described further below for more than:
S101:Obtain the related multiple original documents of target domain.
In this step, the related multiple original documents of target domain are obtained, the method for acquisition there are many kinds.It is for example, false
Original document if desired for acquisition is literature collection corresponding to target domain, in one embodiment, can be from standard literature
The literature collection of target domain is obtained in database, specifically, according to default search strategy, is examined from standard literature database
Rope goes out the pertinent literature set of target domain, and the search strategy here preset at specifically may include:The retrieval of default target domain
Word etc., for example, target domain is " computer ", corresponding term can be " computer ", " server ", etc.;Work as acquisition
After literature collection corresponding to target domain, from each document in the document set extraction include title, keyword, summary and just
The metadata of literary lead-in section;According to the metadata of extraction, the target for more conforming to user's request is filtered out from document set
Literature collection.
In another embodiment, can be captured largely from the bibliographic data base in corresponding website by crawler technology
Target literature.Merely just exemplarily enumerate two kinds and obtain the embodiment that target domain corresponds to original document, in reality
There are other a variety of acquisition modes in, the embodiment of the present invention is not especially limited to this.
S102:Multiple original documents are clustered to generate the conjunction of first kind gathering.
In this step, multiple original documents of the target domain of acquisition are clustered, generation first kind gathering is closed.This
In mode that multiple original documents are clustered include:K-means (K mean cluster), DBSCAN (Density Clustering), BIRCH
(hierarchical clustering) etc.;Specifically, clustering method can be selected according to the property of the original document of acquisition, for example, when user is bright
The quantity that really original document is classified, can select K-means clustering methods.
S103:Preset word list and word relation list according to corresponding to target domain, each class in being closed to first kind gathering
Cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain.
In this step, each class cluster in the first kind gathering conjunction of acquisition is screened, to generate the second class gathering
Close.Specifically, preset word list and word relation list according to corresponding to target domain, each class cluster in being closed to first kind gathering
Screened, remove some and do not associate or associate relatively weak class cluster with target domain, closed so as to generate the second class gathering,
I.e.:The second class gathering includes class cluster associated with target domain or that association is relatively strong in closing.
Include the independent word for belonging to target domain in above-mentioned preset word list, as shown in table 1 for " machine learning is led
Preset word list in domain ", wherein, the word such as " neutral net ", " deep learning " " machine translation " " K-means " belongs to
The word that " machine learning field " is included.It is the preset word relation list in " machine learning field " as shown in table 2, the word closes
Preset relation classification has " identity relation ", " hyponymy ", " coordination " etc. in series of tables, wherein, " K-means " with
" K mean cluster " is identity relation, and " neutral net " and " deep learning " is hyponymy, etc..
Table 1
In actual applications, the limited amount for the word that the word list in target domain is included, and word relation list can
Word in word list is supplemented.For example, in the word relation list of table 2, " K-means " and " K mean cluster " is etc.
Same relation, it is assumed that in word list only " K-means ", when each class cluster screens in being closed to first kind gathering, if waiting to sieve
There is " K mean cluster " in the class cluster of choosing, pass through word relation list:" K mean cluster " is equal with " K-means ", belongs to together
Word in " machine learning " field.
Table 2
Relation classification | Word 1 | Word 2 |
Identity relation | K-means | K mean cluster |
Hyponymy | Neutral net | Deep learning |
Coordination | K-means | DBSCAN |
…… | …… | …… |
The method screened of each class cluster in being closed to first kind gathering is specially:According to preset word list and word relation
List, determines preset word list and time that word that word relation list includes occurs in the conjunction of first kind gathering in each class cluster
Number;According to the number determined, the class cluster associated with target domain is filtered out to generate the second class from the conjunction of first kind gathering
Gathering is closed.
Specific as shown in table 3, " machine learning ", " neutral net ", " deep learning ", " K-means " are " machine learning
The word that preset word list and word relation list are included in field ".Assuming that first kind gathering close in A class clusters in occur " machine
The number of device study " is 500, and the number for " neutral net " occur is 180, and the number for " deep learning " occur is 50, " K- occurs
Means " number is 20, and the number that these words occurs in total is 750.There is " machine in B class clusters in the conjunction of first kind gathering
The number of study " is 20, and the number for " neutral net " occur is 20, and the number for " deep learning " occur is 8, " K- occurs
Means " number is 15, and the number that these words occurs in total is 63;By comparing these words in A classes cluster and B class clusters
The number of appearance, it may be determined that A classes cluster more associates with target domain.
In a kind of preferred embodiment, can be set preset times, for table 3, if the preset times set as 500,
I.e.:The word that preset word list and word relation list include is 750 times (being more than 500 times) in the number that A classes cluster occurs, it is determined that
A classes cluster is the class cluster associated or stronger relevance with target domain;The word that preset word list and word relation list include is in B
The number that class cluster occurs is 63 times (being less than 500 times), it is determined that B classes cluster is relatively weak not associate or associating with target domain
Class cluster.
Table 3
Word | A classes cluster (number) | B classes cluster (number) |
Machine learning | 500 | 20 |
Neutral net | 180 | 20 |
Deep learning | 50 | 8 |
K-means | 20 | 15 |
Amount to (number) | 750 | 63 |
It is above-mentioned directly that the number that the word that preset word list and word relation list include occurs in class cluster is secondary with presetting
Number is compared, and is a kind of relatively simple screening technique that the embodiment of the present invention is enumerated.In actual applications, there are many kinds
The method of screening, as shown in table 4, the word that preset word list and word relation list in " machine learning field " include is in A classes
The number that cluster occurs is 750 times, is 538 times in the number that B classes cluster occurs;If preset times are still 500, according to above-mentioned screening
Method, it is determined that A classes cluster and B class clusters are the associated class cluster in " machine learning " field.But " K-means " occurs in B class clusters
Number be 500, occupy the relatively large ratio of total degree, and to occupy total degree smaller for " machine learning ", " neutral net " etc.
Ratio, therefore, B classes cluster is probably stronger with the relevance in " K-means fields ", and is associated with " machine learning field "
Property is weaker.
To this situation of table 4, the embodiment of the present invention provides a kind of preferable screening mode, is specially:Arranged for preset word
Each word that table and word relation list include sets coefficient, and the coefficient represents that each word pair determines class cluster and target domain phase
The degree of association;Occurred according to the word that preset word list and word relation list include in the conjunction of first kind gathering in each class cluster
Number, and predetermined coefficient corresponding to each word filters out the class associated with target domain from the conjunction of first kind gathering
Cluster is closed with generating the second class gathering.Such as table 4, it is assumed that the coefficient of " machine learning " is arranged into 1, the coefficient of " neutral net " is set
0.6 is set to, the coefficient of " deep learning " is arranged to 0.5, and the coefficient of " K-means " is arranged to 0.1;Then preset word list and word close
The number that the word that series of tables includes occurs in A class clusters is converted into " 500*1+180*0.6+50*0.5+20*0.1=635 ",
The number occurred in B class clusters is converted into " 10*1+20*0.6+8*0.5+500*0.1=76 ", if preset times are still 500,
It is still class cluster associated with target domain or that association is stronger to determine A classes cluster, and B classes cluster is not associate or associate with target domain
Weaker class cluster.
Table 4
Word | A classes cluster (number) | B classes cluster (number) |
Machine learning | 500 | 10 |
Neutral net | 180 | 20 |
Deep learning | 50 | 8 |
K-means | 20 | 500 |
Amount to (number) | 750 | 538 |
In actual applications, the coefficient that the above-mentioned each word included for preset word list and word relation list is set, can
In advance to be analyzed the expectation storehouse filtered out to determine, or there is it is determined that the method for coefficient, the embodiment of the present invention
This is not especially limited.
In actual applications, each class cluster may include multiple (for example, thousands of or up to ten thousand) preset word lists and
Word in word relation list.In a kind of preferred embodiment, each class cluster vector space model (VSM) can be represented;
According to weight corresponding to each feature in vector corresponding to each class cluster, filtered out from the conjunction of first kind gathering related to target domain
For the class cluster of connection to generate the conjunction of the second class gathering, weight here represents that each word pair determines that class cluster is related to target domain respectively
The degree of connection.For example, it is " k that C class clusters, which are expressed as space vector,1a1+k2a2+k3a3+……+knan", wherein, " a1, a2, a3 ...
An " is expressed as in C class clusters comprising the word in preset word list and word relation list (i.e.:Feature), " k1, k2, k3 ...
Kn " corresponds to " a1, a2, a3 ... an " weight respectively.
Preferably, by being weighted summation to weight corresponding to each feature in the corresponding vector of each class cluster, so as to obtain
The degree of association of each class cluster and target domain.Such as:Vector is " a corresponding to D class clusters1+2a2+3a3+6a5", the vector is carried out
Weighted sum, it is as a result 12;Vector is " 2a1+5a corresponding to E class clusters2+3a3+a4+a6", summation is weighted to the vector, is tied
Fruit is 13.Specifically, predetermined threshold value is may also set up, according to the weighted sum result of acquisition and the predetermined threshold value to first kind gathering
Each class cluster is screened in conjunction.
The mode of screening class cluster listed above is exemplary explanation, and the embodiment of the present invention is not limited specifically this
It is fixed.
S104:Closed according to the second class gathering to update preset word list and word relation list.
In this step, closed according to the second class gathering filtered out to update preset word list and word relation list, wherein,
The method for updating preset word list specifically includes:Determine word that preset word list included closed with the second class gathering in it is each
Similarity between the word that class cluster is included;The similarity determined in being closed according to the second class gathering is less than the word of predetermined threshold
Language updates preset word list.
Specifically, it is each during preset word list and the second class gathering are closed according to the vector space model trained in advance
The word that individual class cluster is included is indicated with vector;Further according to the vector determined, any word in preset word list is determined
Similarity in being closed with the second class gathering between any word, the word that similarity is less than to predetermined threshold arrange to update preset word
Table.Here the method for calculating similarity has many kinds, for example, calculating the included angle cosine value between two vectors, Euclidean distance etc..
As shown in Fig. 2 the similarity between the word that preset word list included and the word that F class clusters are included is determined,
If by asking similarity to find:Similarity in " coding ", " internal memory " and preset word list between all words is respectively less than default
Threshold value, illustrate not recording this two word in preset word list, at this moment just " coding " " internal memory " be added in preset word list,
I.e.:Realize and preset word list is updated.
The method of the preset word relation list of renewal provided in an embodiment of the present invention specifically includes:From the conjunction of the second class gathering
Target word is extracted in each original document;The contextual information of foundation either objective word, determine the relation of the target word
Vector;According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;According to appoint
The similarity of two target words of anticipating and the relation in predetermined similarity section, determine that the word of any two target word closes
System, and preset word relation list is updated according to word relation.
Specifically, the contextual information according to target word determines that the method for the relation vector of the target word includes:Point
The word of predetermined number is extracted not from the context of target word;According to the word of extraction, and preset vector space mould
Type, determine the relation vector of the target word.As shown in Figure 3, it is assumed that target word language is " machine learning ", it is assumed that from the second class
One section of content that extraction includes target word in a certain original document during gathering is closed is:" in computer-aided diagnosis field, machine
Device study is widely used in helping medical expert to obtain priori in case from having diagnosed ";Respectively from the upper of " machine learning "
Text and hereafter extract three words, these words be specially " computer ", " auxiliary ", " diagnostic field ", " application ", " medical science " and
" priori ";According to this 6 words of extraction, the relation vector for determining " machine learning " is " b=k1x+k2y+k3z+k4p
+k5q+k6R ", wherein, x, y, z, p, q and r represent this 6 words of extraction respectively.
It is above-mentioned to be simply extracted one section of content for including target word in certain original document, in actual applications, each
There may be many place's contents target word occur in original document, and multiple original documents are included in the conjunction of the second class gathering, this
Sample, whole second class gathering might have the substantial amounts of content for including target word in closing;In addition, include target word for every section
The content of language, the word extracted from the context of target word also can be different, therefore, either objective word during the second class gathering is closed
Relation vector corresponding to language would is that multi-C vector (for example, thousand dimensions or ten thousand dimensional vectors).
The mode of the weight of each feature is in a kind of simple determination target word corresponding relation vector:Directly with each spy
Levy the number occurred in the conjunction of the second class gathering and represent weight.Above-mentioned example is continued to use, the relation vector of " machine learning " is " b=
k1x+k2y+k3z+k4p+k5q+k6R ", if " computer " occurs 6 times in the conjunction of the second class gathering, k1For 6, if " medical science " is
Two class gatherings occur 2 times in closing, then k5For 2, etc..Here the explanation that the mode of weight is also only exemplary is determined, in reality
The mode of the determination weight of a variety of complexity is also had in, the present invention is not especially limited to this.
After relation vector corresponding to all target words in determining that the second class gathering is closed, according to the relation vector, meter
The similarity of any two target word is calculated, calculating the method for similarity here still can use the angle asked between two vectors
Cosine value, Euclidean distance etc..By according to the similarity calculated, and predetermined similarity section, any two mesh is determined
Mark the word relation of word.For example, predetermined similarity section is:" belong to identity relation between 1~0.9 " two words of expression,
" 0.4~0.8 " two words of expression belong to hyponymy, "<0.1 " represents that two words are unconnected or the degree of association is smaller;Assuming that
When the similarity that two target words are calculated according to relation vector is 0.97, this is determined according to predetermined similarity section
Relation between two target words belongs to identity relation, and updates preset word relation list according to the word relation determined.
Specifically, if by the total relation calculated between two target words of the second class gathering, arranged with preset word relation
When the relation of the two the target words recorded in table is identical, then without to the two target words in preset word relation list
Relation is modified;If it is different, then the relation of the two target words in preset word relation list is supplemented;If pass through
Relation between the total two target words calculated of two class gatherings, is not recorded in preset word relation list, then by this two
The relation of individual target word is filled into the preset word relation list.
Above-mentioned the second class of foundation gathering is closed to update the method for preset word list and word relation list be also exemplary
Illustrate, can not be understood to form the present invention specific limit.
Embodiment 2
Based on identical inventive concept, the embodiment of the present invention provides the construction method of another Knowledge Organization System, the party
The schematic flow sheet of method is as shown in figure 4, specifically include following steps:
S401:Obtain the related multiple original documents of target domain.
S402:Multiple original documents are clustered to generate the conjunction of first kind gathering.
S403:Preset word list and word relation list according to corresponding to target domain, each class in being closed to first kind gathering
Cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain.
S404:Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering;
Word duplicate removal to meeting default part of speech and/or word length, and according to the term set of the word generation target domain after duplicate removal.
S405:According to the term set of target domain, determine the second class gathering close in each original document index degree;
According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard.
S406:Preset word list is updated according to the term set of target domain, and is reached in being closed according to the second class gathering
Target original document updates preset word relation list corresponding to target domain.
Above-mentioned steps S401~S403 and S101~S103 in embodiment 1 specific implementation are same or like, to keep away
Exempt to repeat, the embodiment of the present invention no longer describes S401~S403 in detail, and lower mask body illustrates S404~S406.
For S404, from each original document in the conjunction of the second class gathering extraction meet default part of speech and/or word length
Word, default part of speech specifically include:Noun, verb etc.;Word duplicate removal to meeting default part of speech and/or word length, and foundation is gone
The term set of word generation target domain after weight.
In actual applications, do not wrapped in the word for meeting default part of speech and/or word length extracted from each original document
Conventional modal particle, personal pronoun etc. are included, for example, " we " " " " what " etc..And the term set of the target domain of generation
In include to identifying the contributive word of the target domain, for example, including in the term set in " computer realm ":" meter
The word such as calculation machine ", " server ", " internal memory ";For another example include in term set in " machine learning field ":" engineering
Word such as habit ", " neutral net ", " clustering algorithm ", etc..
In a kind of preferred embodiment, after the term set of target domain is obtained, the direct basis target domain
Term set update preset word list (S406).Specifically update method is:Determine the word that the term set is included
Similarity between the word included with preset word list;It is less than predetermined threshold according to the similarity determined in the term set
The word of value updates preset word list.Here the method for preset word list is updated using term set, with embodiment 1 according to
Closed according to the second class gathering to update the method for preset word list similar to (S104), the embodiment of the present invention repeats no more to this.
For S405, according to the term set of target domain, determine the second class gathering close in each original document mark
Degree of drawing.It is determined that the method for index degree is specially:First extract the target word that each original document during the second class gathering is closed includes;
According to the word contained by term set, each original document in being closed to the second class gathering indexes;According to the second class gathering
The length for the target word that any original document in conjunction includes, and the length of word is indexed in the original document, it is determined that
The index degree of the original document during second class gathering is closed.
For example, being extracted m target word in certain document altogether, the term set in field is corresponded to the document according to the document
Indexed, determine that n word is shared in the document to be indexed, the n words being indexed include dittograph, example here
Such as, " computer " occurs 8 times in the publication, then n is 8.The index degree of the document is:
Specifically, the index degree of the document is:The summation of the length of word is indexed in the document, with target in the document
The ratio of the summation of the length of word.In order to clearly illustrate the determination method of the index degree, said below by specific example
Clearly determine the method for table index degree, as shown in table 5, it is assumed that certain document be indexed word for " machine learning ", " neutral net " and
" cluster ", the number for recording the length of each index word in table 5 respectively and occurring in the publication;Assuming that the mesh in the document
Mark word has 500, then the index degree of the document is
Table 5
It is indexed word | Number | Length |
Machine learning | 10 | 4 |
Neutral net | 5 | 4 |
Cluster | 3 | 2 |
It is determined that the method for index degree can also be:Determine the target word of any original document in the conjunction of the second class gathering
Include the quantity of word in term set in quantity, and the target word of the original document;According to the mesh of the original document
The quantity of word is marked, and includes the quantity of word in term set in the target word of the original document, determines the second class
The index degree of the original document during gathering is closed.
For example, k target word is extracted in certain document altogether, wherein, it is the term that the document corresponds to field to share r word
Word included in set, here r word include dittograph.The index degree of the document is:
The example of above-mentioned table 5 is continued to use, the index degree for determining the document is
After the index degree of each original document in determining that the second class gathering is closed, according to the index degree determined, really
Whether each original document during fixed second class gathering is closed is up to standard.Specifically, default value up to standard can be set, by by each original
The index degree of beginning file determines whether each original document is up to standard compared with the default value up to standard.
After determination result is obtained, updated according to original document up to standard in the conjunction of the second class gathering corresponding to target domain
Preset word relation list (S406).Here the method for preset word relation list is updated, with being closed in embodiment 1 according to the second class gathering
Similar (S104) to update the method for preset word relation list, the embodiment of the present invention repeats no more to this.
Above-mentioned S403 is that first kind gathering is closed to screen, and removes some and does not associate or associate relatively with target domain
Weak literature collection, closed so as to generate the second class gathering;And S405 is each original document progress in being closed to the second class gathering
Screening, further increasing the second class gathering close with the degree of association of target domain, and then improve the preset word list of renewal and
The degree of accuracy of word relation list.
In actual applications, during once Knowledge Organization System is updated, it may be necessary to while update multiple targets
Field.In a preferred embodiment, below standard original document is clustered again during the second class gathering is closed, and will
Original document after cluster is sent in the database in other respective objects fields in time, for updating the respective objects field
In preset word list or word relation list.
In another embodiment, during the second class gathering is closed below standard original document by repeat above-mentioned S402~
S406, it is re-used for updating preset word list and word relation list corresponding to target domain.Because in actual implementation knowledge organization
Above-mentioned S401~S406 is performed, it is necessary to be repeated continuously during system, in renewal Knowledge Organization System frequently
Word list and word relation list, so the terminology of the target domain of the clustering method used every time, screening technique and generation
Close etc. be also all continually changing, this second class gathering for causing to determine close in each original document knot whether up to standard
Fruit also can be different.Therefore, in the present embodiment, repeated by original document below standard during the second class gathering is closed above-mentioned
S402~S406, whether up to standard progress to the original document be more accurate to judge;And preset word is updated according to judged result
List and word relation list, are sufficiently used resource.
It is automatically multiple original to this after the related multiple original documents of target domain are obtained using the embodiment of the present invention
File is clustered, screened, and obtains the class cluster associated with the target domain;Then the class, then to the target domain being associated
Each original document in cluster is screened;Finally according to the class cluster after screening automatically to word list preset in the target domain
It is updated with word relation list;Compared to the construction method of Knowledge Organization System in the prior art, the embodiment of the present invention need not
Substantial amounts of manpower is called to be classified to the original document in the target domain of substantial amounts, screened, and in target domain
Preset word list and word relation list are updated, and greatly save resource, while improve the effect of structure Knowledge Organization System
Rate;And by repeatedly being screened to the class cluster associated with the target domain of acquisition, it ensure that for updating preset word
The class gathering of list and word relation list is closed has stronger relevance with target domain, improves the preset word list of renewal and word
The degree of accuracy of relation list.In addition, reduce due to error caused by human factor so that the Knowledge Organization System of structure is more
It is accurate;
Embodiment 3
Based on identical inventive concept, the embodiment of the present invention provides a kind of construction device of Knowledge Organization System, the device
Structural representation as shown in figure 5, the device is specifically included with lower unit:
Acquiring unit 501, cluster cell 502, screening unit 503 and updating block 504, wherein:
Acquiring unit 501 is used to obtain the related multiple original documents of target domain;
Cluster cell 502 is used to cluster multiple original documents to generate the conjunction of first kind gathering;
Screening unit 503 is used for preset word list and word relation list according to corresponding to target domain, to first kind gathering
Each class cluster is screened in conjunction, is closed with generating the second class gathering including the class cluster associated with target domain;
Updating block 504 is used to close to update preset word list and word relation list according to the second class gathering.
The specific workflow of present apparatus embodiment is:First, acquiring unit 501 obtains the related multiple originals of target domain
Beginning file, secondly, cluster cell 502 are clustered to multiple original documents to generate the conjunction of first kind gathering, then, screening unit
503 according to corresponding to target domain preset word list and word relation list, to first kind gathering close in each class cluster screen,
Closed with generating the second class gathering including the class cluster associated with target domain, then, updating block 504 is according to the second class gathering
Close to update preset word list and word relation list.
Had the beneficial effect that using what present apparatus embodiment was obtained:
In embodiments of the present invention, it is automatically multiple original to this after the related multiple original documents of target domain are obtained
File is clustered, screened, and obtains the class cluster related to the target domain, and by the class cluster after the screening automatically to the target
Preset word list and word relation list are updated in field;Compared to the structure side of Knowledge Organization System in the prior art
Method, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, sieved
Choosing, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improve structure
Build the efficiency of Knowledge Organization System;And reduce due to error caused by human factor so that the Knowledge Organization System of structure
It is more accurate.
Present apparatus embodiment realizes that the embodiment of structure Knowledge Organization System has many kinds, for example, in the first implementation
In mode, screening unit 503 is specifically used for:
According to preset word list and word relation list, the preset word list of pre-set list and the word that word relation list includes are determined
The number that language occurs in the conjunction of first kind gathering in each class cluster;According to the number determined, screened from the conjunction of first kind gathering
Go out the class cluster associated with target domain to generate the conjunction of the second class gathering.
In second of embodiment, updating block 504 is specifically used for:
Determine word that preset word list included and the second class gathering close in the word that is included of each class cluster between
Similarity;
The similarity determined in being closed according to the second class gathering updates preset word list less than the word of predetermined threshold.
In the third embodiment, updating block 504 is specifically used for:
Target word is extracted from each original document in the conjunction of the second class gathering;
The contextual information of foundation either objective word, determine the relation vector of the target word;
According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;
According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined
The word relation of word is marked, and preset word relation list is updated according to word relation.
In the 4th kind of embodiment, the construction device also includes:Generation unit, generation unit are used for:
After the step of the second class gathering including the class cluster associated with target domain is closed is generated, from the second class gathering
Extraction meets the word of default part of speech and/or word length in each original document in conjunction;To meet default part of speech and/or word length
Word duplicate removal, and according to the term set of the word generation target domain after duplicate removal;
Wherein, updating block 504 is specifically used for:
Preset word list is updated according to the term set of target domain.
In the 5th kind of embodiment, the construction device also includes:Determining unit, determining unit are used for:
According to the term set of target domain, determine the second class gathering close in each original document index degree;
According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;
Wherein, updating block 504 is specifically used for:
Preset word relation list corresponding to target domain is updated according to original document up to standard in the conjunction of the second class gathering.
In the 6th kind of embodiment, the construction device also includes:Extraction unit, extraction unit are used for:
Extract the target word that each original document during the second class gathering is closed includes;
Wherein it is determined that unit is specifically used for:
According to the word contained by term set, each original document in being closed to the second class gathering indexes;
The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document
The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
In the 7th kind of embodiment, determining unit is specifically used for:
Determine the quantity of the target word of any original document in the conjunction of the second class gathering, and the target of the original document
Include the quantity of word in term set in word;
According to the quantity of the target word of the original document, and include terminology in the target word of the original document
The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
Embodiment 4
Based on identical inventive concept, the embodiment of the present invention also provides a kind of server, the structural representation of the server
As shown in fig. 6, including reservoir 601 and processor 602, memory 601 is used to store the information for including programmed instruction, processor
602 are used for the execution of control program instruction, and program realizes any knowledge provided in an embodiment of the present invention when being performed by processor 602
The step of construction method of organization system.
Specifically, at least one program stored in memory 601 is used to realize following steps when being performed by processor 602
Suddenly:
Obtain the related multiple original documents of target domain;
Multiple original documents are clustered to generate the conjunction of first kind gathering;
Preset word list and word relation list according to corresponding to target domain, each class cluster is carried out in being closed to first kind gathering
Screening, closed with generating the second class gathering including the class cluster associated with target domain;
Closed according to the second class gathering to update preset word list and word relation list.
Preferably, at least one program is used to realize:
According to preset word list and word relation list, determine preset word list and word that word relation list includes is first
The number occurred during class gathering is closed in each class cluster;
According to the number determined, from the conjunction of the first kind gathering class cluster associated with target domain is filtered out to generate the
Two class gatherings are closed.
Preferably, at least one program is used to realize:
Determine word that preset word list included and the second class gathering close in the word that is included of each class cluster between
Similarity;
The similarity determined in being closed according to the second class gathering updates preset word list less than the word of predetermined threshold.
Preferably, at least one program is used to realize:
Target word is extracted from each original document in the conjunction of the second class gathering;
The contextual information of foundation either objective word, determine the relation vector of the target word;
According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;
According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined
The word relation of word is marked, and according to word relation more neologisms relation list.
Preferably, at least one program is used to realize:
Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering;
Word duplicate removal to meeting default part of speech and/or word length, and according to the art of the word generation target domain after duplicate removal
Language set;
Wherein, closed according to the second class gathering to update preset word list, including:
Preset word list is updated according to the term set of target domain.
Preferably, at least one program is used to realize:
According to the term set of target domain, determine the second class gathering close in each original document index degree;
According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;
Wherein, closed according to the second class gathering to update preset word relation list:
Preset word relation list corresponding to target domain is updated according to original document up to standard in the conjunction of the second class gathering.
Preferably, at least one program is used to realize:
Extract the target word that each original document during the second class gathering is closed includes;
Wherein, according to term set, the step of index of each original document in the conjunction of the second class gathering is spent, bag are determined
Include:
According to the word contained by term set, each original document in being closed to the second class gathering indexes;
The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document
The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
Preferably, at least one program is used to realize:
Determine the quantity of the target word of any original document in the conjunction of the second class gathering, and the target of the original document
Include the quantity of word in term set in word;
According to the quantity of the target word of the original document, and include terminology in the target word of the original document
The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
The beneficial effect obtained using server provided in an embodiment of the present invention, with foregoing embodiment of the method or device
The beneficial effect that embodiment is obtained is same or like, and this is repeated no more.
Those skilled in the art of the present technique are appreciated that the present invention includes being related to for performing in operation described herein
One or more equipment.These equipment can specially be designed and manufactured for required purpose, or can also be included general
Known device in computer.These equipment have the computer program being stored in it, and these computer programs are optionally
Activation or reconstruct.Such computer program can be stored in equipment (for example, computer) computer-readable recording medium or be stored in
E-command and it is coupled to respectively in any kind of medium of bus suitable for storage, the computer-readable medium is included but not
Be limited to any kind of disk (including floppy disk, hard disk, CD, CD-ROM and magneto-optic disk), ROM (Read-Only Memory, only
Read memory), RAM (Random Access Memory, immediately memory), EPROM (Erasable Programmable
Read-Only Memory, Erarable Programmable Read only Memory), EEPROM (Electrically Erasable
Programmable Read-Only Memory, EEPROM), flash memory, magnetic card or light card
Piece.It is, computer-readable recording medium includes storing or transmitting any Jie of information in the form of it can read by equipment (for example, computer)
Matter.
Those skilled in the art of the present technique be appreciated that can with computer program instructions come realize these structure charts and/or
The combination of each frame and these structure charts and/or the frame in block diagram and/or flow graph in block diagram and/or flow graph.This technology is led
Field technique personnel be appreciated that these computer program instructions can be supplied to all-purpose computer, special purpose computer or other
The processor of programmable data processing method is realized, so as to pass through the processing of computer or other programmable data processing methods
Device performs the scheme specified in the frame of structure chart and/or block diagram and/or flow graph disclosed by the invention or multiple frames.
Those skilled in the art of the present technique are appreciated that in the various operations discussed in the present invention, method, flow
Step, measure, scheme can be replaced, changed, combined or deleted.Further, it is each with having been discussed in the present invention
Kind operation, method, other steps in flow, measure, scheme can also be replaced, changed, reset, decomposed, combined or deleted.
Further, it is of the prior art to have and the step in the various operations disclosed in the present invention, method, flow, measure, scheme
It can also be replaced, changed, reset, decomposed, combined or deleted.
Described above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art
For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should
It is considered as protection scope of the present invention.
Claims (17)
- A kind of 1. construction method of Knowledge Organization System, it is characterised in that including:Obtain the related multiple original documents of target domain;The multiple original document is clustered to generate the conjunction of first kind gathering;Preset word list and word relation list according to corresponding to the target domain, each class cluster in being closed to the first kind gathering Screened, closed with generating the second class gathering including the class cluster associated with the target domain;Closed according to the second class gathering to update the preset word list and word relation list.
- 2. construction method according to claim 1, it is characterised in that the word preset according to corresponding to the target domain List and word relation list, each class cluster screens in being closed to the first kind gathering, includes leading with the target with generation The step of second class gathering of the associated class cluster in domain is closed, including:According to the preset word list and word relation list, determine the preset word list and word that word relation list includes exists The number occurred during the first kind gathering is closed in each class cluster;According to the number determined, the class cluster associated with the target domain is filtered out with life from first kind gathering conjunction Closed into the second class gathering.
- 3. construction method according to claim 1, it is characterised in that the second class of foundation gathering is closed described pre- to update The step of putting word list, including:Determine word that the preset word list included and the second class gathering close in the word that is included of each class cluster between Similarity;The similarity determined in being closed according to the second class gathering updates the preset word list less than the word of predetermined threshold.
- 4. construction method according to claim 1, it is characterised in that the second class of foundation gathering is closed described pre- to update The step of putting word relation list, including:Target word is extracted from each original document in the conjunction of the second class gathering;The contextual information of foundation either objective word, determine the relation vector of the target word;According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;According to the similarity of any two target word and the relation in predetermined similarity section, any two target word is determined The word relation of language, and the preset word relation list is updated according to word relation.
- 5. according to any described construction methods of claim 1-2, it is characterised in that the generation includes and the target domain After the step of second class gathering of associated class cluster is closed, in addition to:Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering;To the word duplicate removal for meeting default part of speech and/or word length, and the target domain is generated according to the word after duplicate removal Term set;Wherein, closed according to the second class gathering to update the preset word list, including:The preset word list is updated according to the term set of the target domain.
- 6. construction method according to claim 5, it is characterised in that also include:According to the term set of the target domain, determine the second class gathering close in each original document index degree;According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;Wherein, closed according to the second class gathering to update the preset word relation list:Original document up to standard arranges to update preset word relation corresponding to the target domain in being closed according to the second class gathering Table.
- 7. construction method according to claim 5, it is characterised in that it is described according to the term set, determine the second class Before the step of index of each original document in gathering conjunction is spent, in addition to:Extract the target word that each original document during the second class gathering is closed includes;Wherein, it is described according to the term set, determine the second class gathering close in the index of each original document the step of spending, Including:According to the word contained by the term set, each original document in being closed to the second class gathering indexes;The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
- 8. construction method according to claim 7, it is characterised in that it is described according to the term set, determine the second class The step of index of each original document in gathering conjunction is spent, including:Determine the quantity of the target word of any original document in the second class gathering conjunction, and the target of the original document Include the quantity of word in the term set in word;According to the quantity of the target word of the original document, and include the terminology in the target word of the original document The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
- A kind of 9. construction device of Knowledge Organization System, it is characterised in that including:Acquiring unit, cluster cell, screening unit and updating block, wherein:The acquiring unit is used to obtain the related multiple original documents of target domain;The cluster cell is used to cluster the multiple original document to generate the conjunction of first kind gathering;The screening unit is used for preset word list and word relation list according to corresponding to the target domain, to the first kind Each class cluster is screened during gathering is closed, and is closed with generating the second class gathering including the class cluster associated with the target domain;The updating block is used to close to update the preset word list and word relation list according to the second class gathering.
- 10. construction device according to claim 9, it is characterised in that the screening unit is specifically used for:According to the preset word list and word relation list, determine the preset word list and word that word relation list includes exists The number occurred during the first kind gathering is closed in each class cluster;According to the number determined, from first kind gathering conjunction The class cluster associated with the target domain is filtered out to generate the conjunction of the second class gathering.
- 11. construction device according to claim 9, it is characterised in that the updating block is specifically used for:Determine word that the preset word list included and the second class gathering close in the word that is included of each class cluster between Similarity;The similarity determined in being closed according to the second class gathering updates the preset word list less than the word of predetermined threshold.
- 12. construction device according to claim 9, it is characterised in that the updating block is specifically used for:Target word is extracted from each original document in the conjunction of the second class gathering;The contextual information of foundation either objective word, determine the relation vector of the target word;According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction;According to the similarity of any two target word and the relation in predetermined similarity section, any two target word is determined The word relation of language, and the preset word relation list is updated according to word relation.
- 13. according to any described construction devices of claim 9-10, it is characterised in that also include:Generation unit, the generation Unit is used for:After the step of the second class gathering including the class cluster associated with the target domain is closed is generated, from the second class gathering Extraction meets the word of default part of speech and/or word length in each original document in conjunction;Meet default part of speech and/or word to described Long word duplicate removal, and according to the term set of the word generation target domain after duplicate removal;Wherein, the updating block is specifically used for:The preset word list is updated according to the term set of the target domain.
- 14. construction device according to claim 13, it is characterised in that also include:Determining unit, the determining unit are used In:According to the term set of the target domain, determine the second class gathering close in each original document index degree;According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard;Wherein, the updating block is specifically used for:Original document up to standard arranges to update preset word relation corresponding to the target domain in being closed according to the second class gathering Table.
- 15. construction device according to claim 13, it is characterised in that also include:Extraction unit, the extraction unit are used In:Extract the target word that each original document during the second class gathering is closed includes;Wherein, the determining unit is specifically used for:According to the word contained by the term set, each original document in being closed to the second class gathering indexes;The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
- 16. construction device according to claim 15, it is characterised in that the determining unit is specifically used for:Determine the quantity of the target word of any original document in the second class gathering conjunction, and the target of the original document Include the quantity of word in the term set in word;According to the quantity of the target word of the original document, and include the terminology in the target word of the original document The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
- 17. a kind of server, including memory and processor, the memory is used to store the information for including programmed instruction, institute State the execution that processor is used for control program instruction, it is characterised in that program is realized that right such as will during the computing device The step of seeking 1-8 any methods describeds.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710909279.9A CN107679174A (en) | 2017-09-29 | 2017-09-29 | Construction method, device and the server of Knowledge Organization System |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710909279.9A CN107679174A (en) | 2017-09-29 | 2017-09-29 | Construction method, device and the server of Knowledge Organization System |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107679174A true CN107679174A (en) | 2018-02-09 |
Family
ID=61138653
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710909279.9A Pending CN107679174A (en) | 2017-09-29 | 2017-09-29 | Construction method, device and the server of Knowledge Organization System |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107679174A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145084A (en) * | 2018-07-10 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Data processing method, data processing equipment and server |
CN114021200A (en) * | 2022-01-07 | 2022-02-08 | 每日互动股份有限公司 | Data processing system for pkg fuzzification |
CN114638222A (en) * | 2022-05-17 | 2022-06-17 | 天津卓朗科技发展有限公司 | Natural disaster data classification method and model training method and device thereof |
CN117725229A (en) * | 2024-01-08 | 2024-03-19 | 中国科学技术信息研究所 | Knowledge organization system auxiliary updating method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120253793A1 (en) * | 2011-04-01 | 2012-10-04 | Rima Ghannam | System for natural language understanding |
CN103049524A (en) * | 2012-12-20 | 2013-04-17 | 中国科学技术信息研究所 | Method for automatically clustering synonym search results according to lexical meanings |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN106156090A (en) * | 2015-04-01 | 2016-11-23 | 上海宽文是风软件有限公司 | A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree) |
CN106897380A (en) * | 2017-01-20 | 2017-06-27 | 浙江大学 | The self adaptation demand model construction method that a kind of Design-Oriented knowledge is dynamically pushed |
-
2017
- 2017-09-29 CN CN201710909279.9A patent/CN107679174A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120253793A1 (en) * | 2011-04-01 | 2012-10-04 | Rima Ghannam | System for natural language understanding |
CN103049524A (en) * | 2012-12-20 | 2013-04-17 | 中国科学技术信息研究所 | Method for automatically clustering synonym search results according to lexical meanings |
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN106156090A (en) * | 2015-04-01 | 2016-11-23 | 上海宽文是风软件有限公司 | A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree) |
CN106897380A (en) * | 2017-01-20 | 2017-06-27 | 浙江大学 | The self adaptation demand model construction method that a kind of Design-Oriented knowledge is dynamically pushed |
Non-Patent Citations (4)
Title |
---|
周国臻: "《网络环境下叙词表编制与发展》", 30 April 2015, 科学技术文献出版社 * |
张运良等: "知识组织系统自适应构建关键问题研究", 《情报工程》 * |
曾民族: "《知识技术及其应用》", 30 November 2005, 科学技术文献出版社 * |
陆勇: "《面向信息检索的汉语同义词自动识别》", 31 December 2009, 《东南大学出版社》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145084A (en) * | 2018-07-10 | 2019-01-04 | 阿里巴巴集团控股有限公司 | Data processing method, data processing equipment and server |
CN109145084B (en) * | 2018-07-10 | 2022-07-01 | 创新先进技术有限公司 | Data processing method, data processing device and server |
CN114021200A (en) * | 2022-01-07 | 2022-02-08 | 每日互动股份有限公司 | Data processing system for pkg fuzzification |
CN114638222A (en) * | 2022-05-17 | 2022-06-17 | 天津卓朗科技发展有限公司 | Natural disaster data classification method and model training method and device thereof |
CN117725229A (en) * | 2024-01-08 | 2024-03-19 | 中国科学技术信息研究所 | Knowledge organization system auxiliary updating method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109460473B (en) | Electronic medical record multi-label classification method based on symptom extraction and feature representation | |
CN106202514A (en) | Accident based on Agent is across the search method of media information and system | |
WO2015148304A1 (en) | Method and system for large scale data curation | |
CN105912645B (en) | A kind of intelligent answer method and device | |
CN107679174A (en) | Construction method, device and the server of Knowledge Organization System | |
CN110222171A (en) | A kind of application of disaggregated model, disaggregated model training method and device | |
CN109299271A (en) | Training sample generation, text data, public sentiment event category method and relevant device | |
CN103744889B (en) | A kind of method and apparatus for problem progress clustering processing | |
CN113343012B (en) | News matching method, device, equipment and storage medium | |
CN109598307A (en) | Data screening method, apparatus, server and storage medium | |
CN112836029A (en) | Graph-based document retrieval method, system and related components thereof | |
Nicholls et al. | Understanding news story chains using information retrieval and network clustering techniques | |
CN110968664A (en) | Document retrieval method, device, equipment and medium | |
CN109471950A (en) | The construction method of the structural knowledge network of abdominal ultrasonic text data | |
CN113569018A (en) | Question and answer pair mining method and device | |
CN109992665A (en) | A kind of classification method based on the extension of problem target signature | |
Nurhachita et al. | A comparison between deep learning, naïve bayes and random forest for the application of data mining on the admission of new students | |
CN110610766A (en) | Apparatus and storage medium for deriving probability of disease based on symptom feature weight | |
KR102497151B1 (en) | Applicant information filling system and method | |
CN110377741A (en) | File classification method, intelligent terminal and computer readable storage medium | |
CN116628628A (en) | User information literacy analysis method, system and storage medium based on search information | |
CN112215006B (en) | Organization named entity normalization method and system | |
CN112562849B (en) | Clinical automatic diagnosis method and system based on hierarchical structure and co-occurrence structure | |
CN109213830A (en) | The document retrieval system of professional technical documentation | |
CN116110594A (en) | Knowledge evaluation method and system of medical knowledge graph based on associated literature |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180209 |