CN107679174A

CN107679174A - Construction method, device and the server of Knowledge Organization System

Info

Publication number: CN107679174A
Application number: CN201710909279.9A
Authority: CN
Inventors: 张运良; 侯慧敏; 姚长青
Original assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Current assignee: INSTITUTE OF SCIENCE AND TECHNOLOGY INFORMATION OF CHINA
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2018-02-09

Abstract

The invention provides the construction method of Knowledge Organization System, device and server.This method includes：Obtain the related multiple original documents of target domain；Multiple original documents are clustered to generate the conjunction of first kind gathering；Preset word list and word relation list according to corresponding to target domain, each class cluster screens in being closed to first kind gathering, is closed with generating the second class gathering including the class cluster associated with target domain；Closed according to the second class gathering to update preset word list and word relation list.Compared to the construction method of Knowledge Organization System in the prior art, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, screened, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improves the efficiency of structure Knowledge Organization System；And reduce due to error caused by human factor so that the Knowledge Organization System of structure is more accurate.

Description

Construction method, device and the server of Knowledge Organization System

Technical field

The present invention relates to field of computer technology, and specifically, the present invention relates to the construction method of Knowledge Organization System, dress Put and server.

Background technology

Knowledge Organization System (Knowledge Organization System, KOS) is that one kind can be managed by computer Solution, the system for reading and identifying, the system mainly include：The various vocabularys such as word, descriptor are described, are the bases of current searching system Infrastructure.

Prior art is when building Knowledge Organization System, in order that the Knowledge Organization System that must be built is more accurate, big portion The need of work divided manually participates in, for example, when being collected into a large amount of words (for example, keyword, the descriptor etc.) in some field, Artificial (for example, person skilled in art) is needed to be classified to these words, associated.And with the continuous expansion of information Fill, it is necessary to substantial amounts of manpower is updated to Knowledge Organization System every now and then, for example, various vocabularys are more in Knowledge Organization System Newly, the work of this part also relies primarily on manual examination and verification, screening etc. and suitable word is added in corresponding vocabulary.

From the method for above-mentioned existing structure Knowledge Organization System：It is when building Knowledge Organization System, it is necessary to a large amount of Manpower knowledge system is updated and safeguarded, therefore, in actual applications, in order to economize on resources, relevant departments are generally not The shorter cycle can be set to be updated and safeguard knowledge system, made troubles to using the user of the Knowledge Organization System, For example, user possibly can not retrieve related data in time.In addition, this rely primarily on artificial constructed Knowledge Organization System Method, because workload is larger, human error occurs unavoidably, so as to cause the Knowledge Organization System of structure is inaccurate to ask Topic.

The content of the invention

In view of the foregoing, the invention provides the construction method of Knowledge Organization System, device and server, know in structure During knowing organization system, the workload of user is reduced, compared to the method for building knowledge knowledge system in the prior art, sheet The construction method that invention provides more automates, and improves the efficiency of structure Knowledge Organization System.

The embodiments of the invention provide a kind of construction method of Knowledge Organization System, including：

Obtain the related multiple original documents of target domain；

Multiple original documents are clustered to generate the conjunction of first kind gathering；

Preset word list and word relation list according to corresponding to target domain, each class cluster is carried out in being closed to first kind gathering Screening, closed with generating the second class gathering including the class cluster associated with target domain；

Closed according to the second class gathering to update preset word list and word relation list.

Preferably, preset word list and word relation list according to corresponding to target domain, it is each in being closed to first kind gathering Class cluster is screened, to generate the step of the second class gathering including the class cluster associated with target domain is closed, including：

According to preset word list and word relation list, determine preset word list and word that word relation list includes is first The number occurred during class gathering is closed in each class cluster；

According to the number determined, from the conjunction of the first kind gathering class cluster associated with target domain is filtered out to generate the Two class gatherings are closed.

Preferably, the step of being closed according to the second class gathering to update preset word list, including：

Determine word that preset word list included and the second class gathering close in the word that is included of each class cluster between Similarity；

The similarity determined in being closed according to the second class gathering updates preset word list less than the word of predetermined threshold.

Preferably, the step of being closed according to the second class gathering to update preset word relation list, including：

Target word is extracted from each original document in the conjunction of the second class gathering；

The contextual information of foundation either objective word, determine the relation vector of the target word；

According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction；

According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined The word relation of word is marked, and preset word relation list is updated according to word relation.

Preferably, after the step of the second class gathering of the generation including the class cluster associated with target domain is closed, in addition to：

Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering；

Word duplicate removal to meeting default part of speech and/or word length, and according to the art of the word generation target domain after duplicate removal Language set；

Wherein, closed according to the second class gathering to update preset word list, including：

Preset word list is updated according to the term set of target domain.

Preferably, the construction method also includes：

According to the term set of target domain, determine the second class gathering close in each original document index degree；

According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard；

Wherein, closed according to the second class gathering to update preset word relation list：

Preset word relation list corresponding to target domain is updated according to original document up to standard in the conjunction of the second class gathering.

Preferably, according to term set, determine the second class gathering close in the index of each original document the step of spending it Before, in addition to：

Extract the target word that each original document during the second class gathering is closed includes；

Wherein, according to term set, the step of index of each original document in the conjunction of the second class gathering is spent, bag are determined Include：

According to the word contained by term set, each original document in being closed to the second class gathering indexes；

The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.

Preferably, according to term set, the step of index of each original document in the conjunction of the second class gathering is spent, bag are determined Include：

Determine the quantity of the target word of any original document in the conjunction of the second class gathering, and the target of the original document Include the quantity of word in term set in word；

According to the quantity of the target word of the original document, and include terminology in the target word of the original document The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.

The embodiment of the present invention also provides a kind of construction device of Knowledge Organization System, including：

Acquiring unit, cluster cell, screening unit and updating block, wherein：

Acquiring unit is used to obtain the related multiple original documents of target domain；

Cluster cell is used to cluster multiple original documents to generate the conjunction of first kind gathering；

Screening unit is used for preset word list and word relation list according to corresponding to target domain, in being closed to first kind gathering Each class cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain；

Updating block is used to close to update preset word list and word relation list according to the second class gathering.

Preferably, screening unit is specifically used for：

According to preset word list and word relation list, determine preset word list and word that word relation list includes is first The number occurred during class gathering is closed in each class cluster；According to the number determined, filtered out from the conjunction of first kind gathering and target The associated class cluster in field is closed with generating the second class gathering.

Preferably, updating block is specifically used for：

Preferably, in addition to：Generation unit, generation unit are used for：

After the step of the second class gathering including the class cluster associated with target domain is closed is generated, from the second class gathering Extraction meets the word of default part of speech and/or word length in each original document in conjunction；To meet default part of speech and/or word length Word duplicate removal, and according to the term set of the word generation target domain after duplicate removal；

Wherein, updating block is specifically used for：

Preset word list is updated according to the term set of target domain.

Preferably, in addition to：Determining unit, determining unit are used for：

Wherein, updating block is specifically used for：

Preferably, the construction device also includes：Extraction unit, extraction unit are used for：

Wherein it is determined that unit is specifically used for：

Preferably, determining unit is specifically used for：

The embodiment of the present invention also provides a kind of server, including memory and processor, and memory, which is used to store, includes journey The information of sequence instruction, processor are used for the execution of control program instruction, the embodiment of the present invention are realized when program is executed by processor The step of construction method of any Knowledge Organization System provided.

Had the beneficial effect that using what the embodiment of the present invention obtained：

In embodiments of the present invention, after the related multiple original documents of target domain are obtained, by the plurality of original File is clustered, and obtains multiple class clusters (conjunction of first kind gathering) of the target domain；Further according to pre- corresponding to the target domain Put word list and word relation list, the plurality of class cluster screened, remove some it is uncorrelated to the target domain or association compared with Small class cluster, i.e.,：Leave the gathering associated with the target domain and close (conjunction of the second class gathering)；Close to come more according to the second class gathering The new preset word list and word relation list.Using the embodiment of the present invention, the related multiple original documents of target domain are being obtained Afterwards, this multiple original document is clustered automatically, screened, obtain the class cluster related to the target domain, and pass through the screening Class cluster afterwards is updated to word list and word relation list preset in the target domain automatically；Compared to knowing in the prior art Know the construction method of organization system, the embodiment of the present invention need not call substantial amounts of manpower to the original in the target domain of substantial amounts Beginning file is classified, screened, and preset word list in target domain and word relation list are updated, and greatlys save Resource, while improve the efficiency of structure Knowledge Organization System；And reduce due to error caused by human factor so that structure The Knowledge Organization System built is more accurate.

The additional aspect of the present invention and advantage will be set forth in part in the description, and these will become from the following description Obtain substantially, or recognized by the practice of the present invention.

Brief description of the drawings

Of the invention above-mentioned and/or additional aspect and advantage will become from the following description of the accompanying drawings of embodiments Substantially and it is readily appreciated that, wherein：

Fig. 1 is a kind of schematic flow sheet of the construction method for Knowledge Organization System that the embodiment of the present invention 1 provides；

Fig. 2 is that the second class of a kind of foundation gathering that the embodiment of the present invention 1 provides is closed to update the schematic diagram of preset word list；

Fig. 3 is that a kind of the second class of foundation gathering of the embodiment of the present invention 1 is closed to update the schematic diagram of preset word relation list；

Fig. 4 is the schematic diagram of the flow of the construction method for another Knowledge Organization System that the embodiment of the present invention 2 provides；

Fig. 5 is a kind of structural representation of the construction device of Knowledge Organization System of the embodiment of the present invention 3；

Fig. 6 is a kind of structural representation for server that the present invention implements 4.

Embodiment

Embodiments of the invention are described below in detail, the example of the embodiment is shown in the drawings, wherein from beginning to end Same or similar label represents same or similar element or the element with same or like function.Below with reference to attached The embodiment of figure description is exemplary, is only used for explaining the present invention, and is not construed as limiting the claims.

Those skilled in the art of the present technique are appreciated that unless expressly stated, singulative " one " used herein, " one It is individual ", " described " and "the" may also comprise plural form.It is to be further understood that what is used in the specification of the present invention arranges Diction " comprising " refer to the feature, integer, step, operation, element and/or component be present, but it is not excluded that in the presence of or addition One or more other features, integer, step, operation, element, component and/or their groups.It should be understood that when we claim member Part is " connected " or during " coupled " to another element, and it can be directly connected or coupled to other elements, or there may also be Intermediary element.In addition, " connection " used herein or " coupling " can include wireless connection or wireless coupling.It is used herein to arrange Taking leave "and/or" includes whole or any cell and all combinations of one or more associated list items.

Those skilled in the art of the present technique are appreciated that unless otherwise defined, all terms used herein (including technology art Language and scientific terminology), there is the general understanding identical meaning with the those of ordinary skill in art of the present invention.Should also Understand, those terms defined in such as general dictionary, it should be understood that have with the context of prior art The consistent meaning of meaning, and unless by specific definitions as here, idealization or the implication of overly formal otherwise will not be used To explain.

The technical scheme of various embodiments of the present invention is specifically introduced below in conjunction with the accompanying drawings.

Embodiment 1

The embodiments of the invention provide a kind of construction method of Knowledge Organization System, schematic flow sheet such as Fig. 1 of this method It is shown, specifically include following steps：

S101：Obtain the related multiple original documents of target domain.

S102：Multiple original documents are clustered to generate the conjunction of first kind gathering.

S103：Preset word list and word relation list according to corresponding to target domain, each class in being closed to first kind gathering Cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain.

S104：Closed according to the second class gathering to update preset word list and word relation list.

It is automatically multiple original to this after the related multiple original documents of target domain are obtained using the embodiment of the present invention File is clustered, screened, and obtains the class cluster related to the target domain, and by the class cluster after the screening automatically to the target Preset word list and word relation list are updated in field；Compared to the structure side of Knowledge Organization System in the prior art Method, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, sieved Choosing, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improve structure Build the efficiency of Knowledge Organization System；And reduce due to error caused by human factor so that the Knowledge Organization System of structure It is more accurate.

The specific implementation of each step is described further below for more than：

S101：Obtain the related multiple original documents of target domain.

In this step, the related multiple original documents of target domain are obtained, the method for acquisition there are many kinds.It is for example, false Original document if desired for acquisition is literature collection corresponding to target domain, in one embodiment, can be from standard literature The literature collection of target domain is obtained in database, specifically, according to default search strategy, is examined from standard literature database Rope goes out the pertinent literature set of target domain, and the search strategy here preset at specifically may include：The retrieval of default target domain Word etc., for example, target domain is " computer ", corresponding term can be " computer ", " server ", etc.；Work as acquisition After literature collection corresponding to target domain, from each document in the document set extraction include title, keyword, summary and just The metadata of literary lead-in section；According to the metadata of extraction, the target for more conforming to user's request is filtered out from document set Literature collection.

In another embodiment, can be captured largely from the bibliographic data base in corresponding website by crawler technology Target literature.Merely just exemplarily enumerate two kinds and obtain the embodiment that target domain corresponds to original document, in reality There are other a variety of acquisition modes in, the embodiment of the present invention is not especially limited to this.

In this step, multiple original documents of the target domain of acquisition are clustered, generation first kind gathering is closed.This In mode that multiple original documents are clustered include：K-means (K mean cluster), DBSCAN (Density Clustering), BIRCH (hierarchical clustering) etc.；Specifically, clustering method can be selected according to the property of the original document of acquisition, for example, when user is bright The quantity that really original document is classified, can select K-means clustering methods.

In this step, each class cluster in the first kind gathering conjunction of acquisition is screened, to generate the second class gathering Close.Specifically, preset word list and word relation list according to corresponding to target domain, each class cluster in being closed to first kind gathering Screened, remove some and do not associate or associate relatively weak class cluster with target domain, closed so as to generate the second class gathering, I.e.：The second class gathering includes class cluster associated with target domain or that association is relatively strong in closing.

Include the independent word for belonging to target domain in above-mentioned preset word list, as shown in table 1 for " machine learning is led Preset word list in domain ", wherein, the word such as " neutral net ", " deep learning " " machine translation " " K-means " belongs to The word that " machine learning field " is included.It is the preset word relation list in " machine learning field " as shown in table 2, the word closes Preset relation classification has " identity relation ", " hyponymy ", " coordination " etc. in series of tables, wherein, " K-means " with " K mean cluster " is identity relation, and " neutral net " and " deep learning " is hyponymy, etc..

Table 1

In actual applications, the limited amount for the word that the word list in target domain is included, and word relation list can Word in word list is supplemented.For example, in the word relation list of table 2, " K-means " and " K mean cluster " is etc. Same relation, it is assumed that in word list only " K-means ", when each class cluster screens in being closed to first kind gathering, if waiting to sieve There is " K mean cluster " in the class cluster of choosing, pass through word relation list：" K mean cluster " is equal with " K-means ", belongs to together Word in " machine learning " field.

Table 2

Relation classification	Word 1	Word 2
			Identity relation	K-means	K mean cluster
Hyponymy	Neutral net	Deep learning
			Coordination	K-means	DBSCAN
……	……	……

The method screened of each class cluster in being closed to first kind gathering is specially：According to preset word list and word relation List, determines preset word list and time that word that word relation list includes occurs in the conjunction of first kind gathering in each class cluster Number；According to the number determined, the class cluster associated with target domain is filtered out to generate the second class from the conjunction of first kind gathering Gathering is closed.

Specific as shown in table 3, " machine learning ", " neutral net ", " deep learning ", " K-means " are " machine learning The word that preset word list and word relation list are included in field ".Assuming that first kind gathering close in A class clusters in occur " machine The number of device study " is 500, and the number for " neutral net " occur is 180, and the number for " deep learning " occur is 50, " K- occurs Means " number is 20, and the number that these words occurs in total is 750.There is " machine in B class clusters in the conjunction of first kind gathering The number of study " is 20, and the number for " neutral net " occur is 20, and the number for " deep learning " occur is 8, " K- occurs Means " number is 15, and the number that these words occurs in total is 63；By comparing these words in A classes cluster and B class clusters The number of appearance, it may be determined that A classes cluster more associates with target domain.

In a kind of preferred embodiment, can be set preset times, for table 3, if the preset times set as 500, I.e.：The word that preset word list and word relation list include is 750 times (being more than 500 times) in the number that A classes cluster occurs, it is determined that A classes cluster is the class cluster associated or stronger relevance with target domain；The word that preset word list and word relation list include is in B The number that class cluster occurs is 63 times (being less than 500 times), it is determined that B classes cluster is relatively weak not associate or associating with target domain Class cluster.

Table 3

Word	A classes cluster (number)	B classes cluster (number)
			Machine learning	500	20
Neutral net	180	20
			Deep learning	50	8
K-means	20	15
			Amount to (number)	750	63

It is above-mentioned directly that the number that the word that preset word list and word relation list include occurs in class cluster is secondary with presetting Number is compared, and is a kind of relatively simple screening technique that the embodiment of the present invention is enumerated.In actual applications, there are many kinds The method of screening, as shown in table 4, the word that preset word list and word relation list in " machine learning field " include is in A classes The number that cluster occurs is 750 times, is 538 times in the number that B classes cluster occurs；If preset times are still 500, according to above-mentioned screening Method, it is determined that A classes cluster and B class clusters are the associated class cluster in " machine learning " field.But " K-means " occurs in B class clusters Number be 500, occupy the relatively large ratio of total degree, and to occupy total degree smaller for " machine learning ", " neutral net " etc. Ratio, therefore, B classes cluster is probably stronger with the relevance in " K-means fields ", and is associated with " machine learning field " Property is weaker.

To this situation of table 4, the embodiment of the present invention provides a kind of preferable screening mode, is specially：Arranged for preset word Each word that table and word relation list include sets coefficient, and the coefficient represents that each word pair determines class cluster and target domain phase The degree of association；Occurred according to the word that preset word list and word relation list include in the conjunction of first kind gathering in each class cluster Number, and predetermined coefficient corresponding to each word filters out the class associated with target domain from the conjunction of first kind gathering Cluster is closed with generating the second class gathering.Such as table 4, it is assumed that the coefficient of " machine learning " is arranged into 1, the coefficient of " neutral net " is set 0.6 is set to, the coefficient of " deep learning " is arranged to 0.5, and the coefficient of " K-means " is arranged to 0.1；Then preset word list and word close The number that the word that series of tables includes occurs in A class clusters is converted into " 500*1+180*0.6+50*0.5+20*0.1=635 ", The number occurred in B class clusters is converted into " 10*1+20*0.6+8*0.5+500*0.1=76 ", if preset times are still 500, It is still class cluster associated with target domain or that association is stronger to determine A classes cluster, and B classes cluster is not associate or associate with target domain Weaker class cluster.

Table 4

Word	A classes cluster (number)	B classes cluster (number)
			Machine learning	500	10
Neutral net	180	20
			Deep learning	50	8
K-means	20	500
			Amount to (number)	750	538

In actual applications, the coefficient that the above-mentioned each word included for preset word list and word relation list is set, can In advance to be analyzed the expectation storehouse filtered out to determine, or there is it is determined that the method for coefficient, the embodiment of the present invention This is not especially limited.

In actual applications, each class cluster may include multiple (for example, thousands of or up to ten thousand) preset word lists and Word in word relation list.In a kind of preferred embodiment, each class cluster vector space model (VSM) can be represented； According to weight corresponding to each feature in vector corresponding to each class cluster, filtered out from the conjunction of first kind gathering related to target domain For the class cluster of connection to generate the conjunction of the second class gathering, weight here represents that each word pair determines that class cluster is related to target domain respectively The degree of connection.For example, it is " k that C class clusters, which are expressed as space vector,₁a₁+k₂a₂+k₃a₃+……+k_na_n", wherein, " a1, a2, a3 ... An " is expressed as in C class clusters comprising the word in preset word list and word relation list (i.e.：Feature), " k1, k2, k3 ... Kn " corresponds to " a1, a2, a3 ... an " weight respectively.

Preferably, by being weighted summation to weight corresponding to each feature in the corresponding vector of each class cluster, so as to obtain The degree of association of each class cluster and target domain.Such as：Vector is " a corresponding to D class clusters₁+2a₂+3a₃+6a₅", the vector is carried out Weighted sum, it is as a result 12；Vector is " 2a1+5a corresponding to E class clusters₂+3a₃+a₄+a₆", summation is weighted to the vector, is tied Fruit is 13.Specifically, predetermined threshold value is may also set up, according to the weighted sum result of acquisition and the predetermined threshold value to first kind gathering Each class cluster is screened in conjunction.

The mode of screening class cluster listed above is exemplary explanation, and the embodiment of the present invention is not limited specifically this It is fixed.

In this step, closed according to the second class gathering filtered out to update preset word list and word relation list, wherein, The method for updating preset word list specifically includes：Determine word that preset word list included closed with the second class gathering in it is each Similarity between the word that class cluster is included；The similarity determined in being closed according to the second class gathering is less than the word of predetermined threshold Language updates preset word list.

Specifically, it is each during preset word list and the second class gathering are closed according to the vector space model trained in advance The word that individual class cluster is included is indicated with vector；Further according to the vector determined, any word in preset word list is determined Similarity in being closed with the second class gathering between any word, the word that similarity is less than to predetermined threshold arrange to update preset word Table.Here the method for calculating similarity has many kinds, for example, calculating the included angle cosine value between two vectors, Euclidean distance etc..

As shown in Fig. 2 the similarity between the word that preset word list included and the word that F class clusters are included is determined, If by asking similarity to find：Similarity in " coding ", " internal memory " and preset word list between all words is respectively less than default Threshold value, illustrate not recording this two word in preset word list, at this moment just " coding " " internal memory " be added in preset word list, I.e.：Realize and preset word list is updated.

The method of the preset word relation list of renewal provided in an embodiment of the present invention specifically includes：From the conjunction of the second class gathering Target word is extracted in each original document；The contextual information of foundation either objective word, determine the relation of the target word Vector；According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction；According to appoint The similarity of two target words of anticipating and the relation in predetermined similarity section, determine that the word of any two target word closes System, and preset word relation list is updated according to word relation.

Specifically, the contextual information according to target word determines that the method for the relation vector of the target word includes：Point The word of predetermined number is extracted not from the context of target word；According to the word of extraction, and preset vector space mould Type, determine the relation vector of the target word.As shown in Figure 3, it is assumed that target word language is " machine learning ", it is assumed that from the second class One section of content that extraction includes target word in a certain original document during gathering is closed is：" in computer-aided diagnosis field, machine Device study is widely used in helping medical expert to obtain priori in case from having diagnosed "；Respectively from the upper of " machine learning " Text and hereafter extract three words, these words be specially " computer ", " auxiliary ", " diagnostic field ", " application ", " medical science " and " priori "；According to this 6 words of extraction, the relation vector for determining " machine learning " is " b=k₁x+k₂y+k₃z+k₄p +k₅q+k₆R ", wherein, x, y, z, p, q and r represent this 6 words of extraction respectively.

It is above-mentioned to be simply extracted one section of content for including target word in certain original document, in actual applications, each There may be many place's contents target word occur in original document, and multiple original documents are included in the conjunction of the second class gathering, this Sample, whole second class gathering might have the substantial amounts of content for including target word in closing；In addition, include target word for every section The content of language, the word extracted from the context of target word also can be different, therefore, either objective word during the second class gathering is closed Relation vector corresponding to language would is that multi-C vector (for example, thousand dimensions or ten thousand dimensional vectors).

The mode of the weight of each feature is in a kind of simple determination target word corresponding relation vector：Directly with each spy Levy the number occurred in the conjunction of the second class gathering and represent weight.Above-mentioned example is continued to use, the relation vector of " machine learning " is " b= k₁x+k₂y+k₃z+k₄p+k₅q+k₆R ", if " computer " occurs 6 times in the conjunction of the second class gathering, k₁For 6, if " medical science " is Two class gatherings occur 2 times in closing, then k₅For 2, etc..Here the explanation that the mode of weight is also only exemplary is determined, in reality The mode of the determination weight of a variety of complexity is also had in, the present invention is not especially limited to this.

After relation vector corresponding to all target words in determining that the second class gathering is closed, according to the relation vector, meter The similarity of any two target word is calculated, calculating the method for similarity here still can use the angle asked between two vectors Cosine value, Euclidean distance etc..By according to the similarity calculated, and predetermined similarity section, any two mesh is determined Mark the word relation of word.For example, predetermined similarity section is：" belong to identity relation between 1~0.9 " two words of expression, " 0.4~0.8 " two words of expression belong to hyponymy, "<0.1 " represents that two words are unconnected or the degree of association is smaller；Assuming that When the similarity that two target words are calculated according to relation vector is 0.97, this is determined according to predetermined similarity section Relation between two target words belongs to identity relation, and updates preset word relation list according to the word relation determined.

Specifically, if by the total relation calculated between two target words of the second class gathering, arranged with preset word relation When the relation of the two the target words recorded in table is identical, then without to the two target words in preset word relation list Relation is modified；If it is different, then the relation of the two target words in preset word relation list is supplemented；If pass through Relation between the total two target words calculated of two class gatherings, is not recorded in preset word relation list, then by this two The relation of individual target word is filled into the preset word relation list.

Above-mentioned the second class of foundation gathering is closed to update the method for preset word list and word relation list be also exemplary Illustrate, can not be understood to form the present invention specific limit.

Embodiment 2

Based on identical inventive concept, the embodiment of the present invention provides the construction method of another Knowledge Organization System, the party The schematic flow sheet of method is as shown in figure 4, specifically include following steps：

S401：Obtain the related multiple original documents of target domain.

S402：Multiple original documents are clustered to generate the conjunction of first kind gathering.

S403：Preset word list and word relation list according to corresponding to target domain, each class in being closed to first kind gathering Cluster is screened, and is closed with generating the second class gathering including the class cluster associated with target domain.

S404：Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering； Word duplicate removal to meeting default part of speech and/or word length, and according to the term set of the word generation target domain after duplicate removal.

S405：According to the term set of target domain, determine the second class gathering close in each original document index degree； According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard.

S406：Preset word list is updated according to the term set of target domain, and is reached in being closed according to the second class gathering Target original document updates preset word relation list corresponding to target domain.

Above-mentioned steps S401~S403 and S101~S103 in embodiment 1 specific implementation are same or like, to keep away Exempt to repeat, the embodiment of the present invention no longer describes S401~S403 in detail, and lower mask body illustrates S404~S406.

For S404, from each original document in the conjunction of the second class gathering extraction meet default part of speech and/or word length Word, default part of speech specifically include：Noun, verb etc.；Word duplicate removal to meeting default part of speech and/or word length, and foundation is gone The term set of word generation target domain after weight.

In actual applications, do not wrapped in the word for meeting default part of speech and/or word length extracted from each original document Conventional modal particle, personal pronoun etc. are included, for example, " we " " " " what " etc..And the term set of the target domain of generation In include to identifying the contributive word of the target domain, for example, including in the term set in " computer realm "：" meter The word such as calculation machine ", " server ", " internal memory "；For another example include in term set in " machine learning field "：" engineering Word such as habit ", " neutral net ", " clustering algorithm ", etc..

In a kind of preferred embodiment, after the term set of target domain is obtained, the direct basis target domain Term set update preset word list (S406).Specifically update method is：Determine the word that the term set is included Similarity between the word included with preset word list；It is less than predetermined threshold according to the similarity determined in the term set The word of value updates preset word list.Here the method for preset word list is updated using term set, with embodiment 1 according to Closed according to the second class gathering to update the method for preset word list similar to (S104), the embodiment of the present invention repeats no more to this.

For S405, according to the term set of target domain, determine the second class gathering close in each original document mark Degree of drawing.It is determined that the method for index degree is specially：First extract the target word that each original document during the second class gathering is closed includes； According to the word contained by term set, each original document in being closed to the second class gathering indexes；According to the second class gathering The length for the target word that any original document in conjunction includes, and the length of word is indexed in the original document, it is determined that The index degree of the original document during second class gathering is closed.

For example, being extracted m target word in certain document altogether, the term set in field is corresponded to the document according to the document Indexed, determine that n word is shared in the document to be indexed, the n words being indexed include dittograph, example here Such as, " computer " occurs 8 times in the publication, then n is 8.The index degree of the document is：

Specifically, the index degree of the document is：The summation of the length of word is indexed in the document, with target in the document The ratio of the summation of the length of word.In order to clearly illustrate the determination method of the index degree, said below by specific example Clearly determine the method for table index degree, as shown in table 5, it is assumed that certain document be indexed word for " machine learning ", " neutral net " and " cluster ", the number for recording the length of each index word in table 5 respectively and occurring in the publication；Assuming that the mesh in the document Mark word has 500, then the index degree of the document is

Table 5

It is indexed word	Number	Length
			Machine learning	10	4
Neutral net	5	4
			Cluster	3	2

It is determined that the method for index degree can also be：Determine the target word of any original document in the conjunction of the second class gathering Include the quantity of word in term set in quantity, and the target word of the original document；According to the mesh of the original document The quantity of word is marked, and includes the quantity of word in term set in the target word of the original document, determines the second class The index degree of the original document during gathering is closed.

For example, k target word is extracted in certain document altogether, wherein, it is the term that the document corresponds to field to share r word Word included in set, here r word include dittograph.The index degree of the document is：

The example of above-mentioned table 5 is continued to use, the index degree for determining the document is

After the index degree of each original document in determining that the second class gathering is closed, according to the index degree determined, really Whether each original document during fixed second class gathering is closed is up to standard.Specifically, default value up to standard can be set, by by each original The index degree of beginning file determines whether each original document is up to standard compared with the default value up to standard.

After determination result is obtained, updated according to original document up to standard in the conjunction of the second class gathering corresponding to target domain Preset word relation list (S406).Here the method for preset word relation list is updated, with being closed in embodiment 1 according to the second class gathering Similar (S104) to update the method for preset word relation list, the embodiment of the present invention repeats no more to this.

Above-mentioned S403 is that first kind gathering is closed to screen, and removes some and does not associate or associate relatively with target domain Weak literature collection, closed so as to generate the second class gathering；And S405 is each original document progress in being closed to the second class gathering Screening, further increasing the second class gathering close with the degree of association of target domain, and then improve the preset word list of renewal and The degree of accuracy of word relation list.

In actual applications, during once Knowledge Organization System is updated, it may be necessary to while update multiple targets Field.In a preferred embodiment, below standard original document is clustered again during the second class gathering is closed, and will Original document after cluster is sent in the database in other respective objects fields in time, for updating the respective objects field In preset word list or word relation list.

In another embodiment, during the second class gathering is closed below standard original document by repeat above-mentioned S402~ S406, it is re-used for updating preset word list and word relation list corresponding to target domain.Because in actual implementation knowledge organization Above-mentioned S401~S406 is performed, it is necessary to be repeated continuously during system, in renewal Knowledge Organization System frequently Word list and word relation list, so the terminology of the target domain of the clustering method used every time, screening technique and generation Close etc. be also all continually changing, this second class gathering for causing to determine close in each original document knot whether up to standard Fruit also can be different.Therefore, in the present embodiment, repeated by original document below standard during the second class gathering is closed above-mentioned S402~S406, whether up to standard progress to the original document be more accurate to judge；And preset word is updated according to judged result List and word relation list, are sufficiently used resource.

It is automatically multiple original to this after the related multiple original documents of target domain are obtained using the embodiment of the present invention File is clustered, screened, and obtains the class cluster associated with the target domain；Then the class, then to the target domain being associated Each original document in cluster is screened；Finally according to the class cluster after screening automatically to word list preset in the target domain It is updated with word relation list；Compared to the construction method of Knowledge Organization System in the prior art, the embodiment of the present invention need not Substantial amounts of manpower is called to be classified to the original document in the target domain of substantial amounts, screened, and in target domain Preset word list and word relation list are updated, and greatly save resource, while improve the effect of structure Knowledge Organization System Rate；And by repeatedly being screened to the class cluster associated with the target domain of acquisition, it ensure that for updating preset word The class gathering of list and word relation list is closed has stronger relevance with target domain, improves the preset word list of renewal and word The degree of accuracy of relation list.In addition, reduce due to error caused by human factor so that the Knowledge Organization System of structure is more It is accurate；

Embodiment 3

Based on identical inventive concept, the embodiment of the present invention provides a kind of construction device of Knowledge Organization System, the device Structural representation as shown in figure 5, the device is specifically included with lower unit：

Acquiring unit 501, cluster cell 502, screening unit 503 and updating block 504, wherein：

Acquiring unit 501 is used to obtain the related multiple original documents of target domain；

Cluster cell 502 is used to cluster multiple original documents to generate the conjunction of first kind gathering；

Screening unit 503 is used for preset word list and word relation list according to corresponding to target domain, to first kind gathering Each class cluster is screened in conjunction, is closed with generating the second class gathering including the class cluster associated with target domain；

Updating block 504 is used to close to update preset word list and word relation list according to the second class gathering.

The specific workflow of present apparatus embodiment is：First, acquiring unit 501 obtains the related multiple originals of target domain Beginning file, secondly, cluster cell 502 are clustered to multiple original documents to generate the conjunction of first kind gathering, then, screening unit 503 according to corresponding to target domain preset word list and word relation list, to first kind gathering close in each class cluster screen, Closed with generating the second class gathering including the class cluster associated with target domain, then, updating block 504 is according to the second class gathering Close to update preset word list and word relation list.

Had the beneficial effect that using what present apparatus embodiment was obtained：

In embodiments of the present invention, it is automatically multiple original to this after the related multiple original documents of target domain are obtained File is clustered, screened, and obtains the class cluster related to the target domain, and by the class cluster after the screening automatically to the target Preset word list and word relation list are updated in field；Compared to the structure side of Knowledge Organization System in the prior art Method, the embodiment of the present invention need not call substantial amounts of manpower to be classified to the original document in the target domain of substantial amounts, sieved Choosing, and preset word list in target domain and word relation list are updated, resource is greatlyd save, while improve structure Build the efficiency of Knowledge Organization System；And reduce due to error caused by human factor so that the Knowledge Organization System of structure It is more accurate.

Present apparatus embodiment realizes that the embodiment of structure Knowledge Organization System has many kinds, for example, in the first implementation In mode, screening unit 503 is specifically used for：

According to preset word list and word relation list, the preset word list of pre-set list and the word that word relation list includes are determined The number that language occurs in the conjunction of first kind gathering in each class cluster；According to the number determined, screened from the conjunction of first kind gathering Go out the class cluster associated with target domain to generate the conjunction of the second class gathering.

In second of embodiment, updating block 504 is specifically used for：

In the third embodiment, updating block 504 is specifically used for：

In the 4th kind of embodiment, the construction device also includes：Generation unit, generation unit are used for：

Wherein, updating block 504 is specifically used for：

Preset word list is updated according to the term set of target domain.

In the 5th kind of embodiment, the construction device also includes：Determining unit, determining unit are used for：

Wherein, updating block 504 is specifically used for：

In the 6th kind of embodiment, the construction device also includes：Extraction unit, extraction unit are used for：

Wherein it is determined that unit is specifically used for：

In the 7th kind of embodiment, determining unit is specifically used for：

Embodiment 4

Based on identical inventive concept, the embodiment of the present invention also provides a kind of server, the structural representation of the server As shown in fig. 6, including reservoir 601 and processor 602, memory 601 is used to store the information for including programmed instruction, processor 602 are used for the execution of control program instruction, and program realizes any knowledge provided in an embodiment of the present invention when being performed by processor 602 The step of construction method of organization system.

Specifically, at least one program stored in memory 601 is used to realize following steps when being performed by processor 602 Suddenly：

Obtain the related multiple original documents of target domain；

Preferably, at least one program is used to realize：

According to the similarity of any two target word and the relation in predetermined similarity section, any two mesh is determined The word relation of word is marked, and according to word relation more neologisms relation list.

Preferably, at least one program is used to realize：

Preset word list is updated according to the term set of target domain.

Preferably, at least one program is used to realize：

The beneficial effect obtained using server provided in an embodiment of the present invention, with foregoing embodiment of the method or device The beneficial effect that embodiment is obtained is same or like, and this is repeated no more.

Those skilled in the art of the present technique are appreciated that the present invention includes being related to for performing in operation described herein One or more equipment.These equipment can specially be designed and manufactured for required purpose, or can also be included general Known device in computer.These equipment have the computer program being stored in it, and these computer programs are optionally Activation or reconstruct.Such computer program can be stored in equipment (for example, computer) computer-readable recording medium or be stored in E-command and it is coupled to respectively in any kind of medium of bus suitable for storage, the computer-readable medium is included but not Be limited to any kind of disk (including floppy disk, hard disk, CD, CD-ROM and magneto-optic disk), ROM (Read-Only Memory, only Read memory), RAM (Random Access Memory, immediately memory), EPROM (Erasable Programmable Read-Only Memory, Erarable Programmable Read only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash memory, magnetic card or light card Piece.It is, computer-readable recording medium includes storing or transmitting any Jie of information in the form of it can read by equipment (for example, computer) Matter.

Those skilled in the art of the present technique be appreciated that can with computer program instructions come realize these structure charts and/or The combination of each frame and these structure charts and/or the frame in block diagram and/or flow graph in block diagram and/or flow graph.This technology is led Field technique personnel be appreciated that these computer program instructions can be supplied to all-purpose computer, special purpose computer or other The processor of programmable data processing method is realized, so as to pass through the processing of computer or other programmable data processing methods Device performs the scheme specified in the frame of structure chart and/or block diagram and/or flow graph disclosed by the invention or multiple frames.

Those skilled in the art of the present technique are appreciated that in the various operations discussed in the present invention, method, flow Step, measure, scheme can be replaced, changed, combined or deleted.Further, it is each with having been discussed in the present invention Kind operation, method, other steps in flow, measure, scheme can also be replaced, changed, reset, decomposed, combined or deleted. Further, it is of the prior art to have and the step in the various operations disclosed in the present invention, method, flow, measure, scheme It can also be replaced, changed, reset, decomposed, combined or deleted.

Described above is only some embodiments of the present invention, it is noted that for the ordinary skill people of the art For member, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications also should It is considered as protection scope of the present invention.

Claims

A kind of 1. construction method of Knowledge Organization System, it is characterised in that including：

Obtain the related multiple original documents of target domain；

The multiple original document is clustered to generate the conjunction of first kind gathering；

Preset word list and word relation list according to corresponding to the target domain, each class cluster in being closed to the first kind gathering Screened, closed with generating the second class gathering including the class cluster associated with the target domain；

Closed according to the second class gathering to update the preset word list and word relation list.
2. construction method according to claim 1, it is characterised in that the word preset according to corresponding to the target domain List and word relation list, each class cluster screens in being closed to the first kind gathering, includes leading with the target with generation The step of second class gathering of the associated class cluster in domain is closed, including：

According to the preset word list and word relation list, determine the preset word list and word that word relation list includes exists The number occurred during the first kind gathering is closed in each class cluster；

According to the number determined, the class cluster associated with the target domain is filtered out with life from first kind gathering conjunction Closed into the second class gathering.
3. construction method according to claim 1, it is characterised in that the second class of foundation gathering is closed described pre- to update The step of putting word list, including：

Determine word that the preset word list included and the second class gathering close in the word that is included of each class cluster between Similarity；

The similarity determined in being closed according to the second class gathering updates the preset word list less than the word of predetermined threshold.
4. construction method according to claim 1, it is characterised in that the second class of foundation gathering is closed described pre- to update The step of putting word relation list, including：

Target word is extracted from each original document in the conjunction of the second class gathering；

The contextual information of foundation either objective word, determine the relation vector of the target word；

According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction；

According to the similarity of any two target word and the relation in predetermined similarity section, any two target word is determined The word relation of language, and the preset word relation list is updated according to word relation.
5. according to any described construction methods of claim 1-2, it is characterised in that the generation includes and the target domain After the step of second class gathering of associated class cluster is closed, in addition to：

Extraction meets the word of default part of speech and/or word length from each original document in the conjunction of the second class gathering；

To the word duplicate removal for meeting default part of speech and/or word length, and the target domain is generated according to the word after duplicate removal Term set；

Wherein, closed according to the second class gathering to update the preset word list, including：

The preset word list is updated according to the term set of the target domain.
6. construction method according to claim 5, it is characterised in that also include：

According to the term set of the target domain, determine the second class gathering close in each original document index degree；

According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard；

Wherein, closed according to the second class gathering to update the preset word relation list：

Original document up to standard arranges to update preset word relation corresponding to the target domain in being closed according to the second class gathering Table.
7. construction method according to claim 5, it is characterised in that it is described according to the term set, determine the second class Before the step of index of each original document in gathering conjunction is spent, in addition to：

Extract the target word that each original document during the second class gathering is closed includes；

Wherein, it is described according to the term set, determine the second class gathering close in the index of each original document the step of spending, Including：

According to the word contained by the term set, each original document in being closed to the second class gathering indexes；

The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
8. construction method according to claim 7, it is characterised in that it is described according to the term set, determine the second class The step of index of each original document in gathering conjunction is spent, including：

Determine the quantity of the target word of any original document in the second class gathering conjunction, and the target of the original document Include the quantity of word in the term set in word；

According to the quantity of the target word of the original document, and include the terminology in the target word of the original document The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
A kind of 9. construction device of Knowledge Organization System, it is characterised in that including：

Acquiring unit, cluster cell, screening unit and updating block, wherein：

The acquiring unit is used to obtain the related multiple original documents of target domain；

The cluster cell is used to cluster the multiple original document to generate the conjunction of first kind gathering；

The screening unit is used for preset word list and word relation list according to corresponding to the target domain, to the first kind Each class cluster is screened during gathering is closed, and is closed with generating the second class gathering including the class cluster associated with the target domain；

The updating block is used to close to update the preset word list and word relation list according to the second class gathering.
10. construction device according to claim 9, it is characterised in that the screening unit is specifically used for：

According to the preset word list and word relation list, determine the preset word list and word that word relation list includes exists The number occurred during the first kind gathering is closed in each class cluster；According to the number determined, from first kind gathering conjunction The class cluster associated with the target domain is filtered out to generate the conjunction of the second class gathering.
11. construction device according to claim 9, it is characterised in that the updating block is specifically used for：

Determine word that the preset word list included and the second class gathering close in the word that is included of each class cluster between Similarity；

The similarity determined in being closed according to the second class gathering updates the preset word list less than the word of predetermined threshold.
12. construction device according to claim 9, it is characterised in that the updating block is specifically used for：

Target word is extracted from each original document in the conjunction of the second class gathering；

The contextual information of foundation either objective word, determine the relation vector of the target word；

According to the relation vector determined, the similarity of any two target word in calculating the second class gathering conjunction；

According to the similarity of any two target word and the relation in predetermined similarity section, any two target word is determined The word relation of language, and the preset word relation list is updated according to word relation.
13. according to any described construction devices of claim 9-10, it is characterised in that also include：Generation unit, the generation Unit is used for：

After the step of the second class gathering including the class cluster associated with the target domain is closed is generated, from the second class gathering Extraction meets the word of default part of speech and/or word length in each original document in conjunction；Meet default part of speech and/or word to described Long word duplicate removal, and according to the term set of the word generation target domain after duplicate removal；

Wherein, the updating block is specifically used for：

The preset word list is updated according to the term set of the target domain.
14. construction device according to claim 13, it is characterised in that also include：Determining unit, the determining unit are used In：

According to the term set of the target domain, determine the second class gathering close in each original document index degree；

According to the index degree determined, determine whether each original document in the conjunction of the second class gathering is up to standard；

Wherein, the updating block is specifically used for：

Original document up to standard arranges to update preset word relation corresponding to the target domain in being closed according to the second class gathering Table.
15. construction device according to claim 13, it is characterised in that also include：Extraction unit, the extraction unit are used In：

Extract the target word that each original document during the second class gathering is closed includes；

Wherein, the determining unit is specifically used for：

According to the word contained by the term set, each original document in being closed to the second class gathering indexes；

The length for the target word that any original document in being closed according to the second class gathering includes, and in the original document The length of word is indexed, determines the index degree of the original document in the conjunction of the second class gathering.
16. construction device according to claim 15, it is characterised in that the determining unit is specifically used for：

Determine the quantity of the target word of any original document in the second class gathering conjunction, and the target of the original document Include the quantity of word in the term set in word；

According to the quantity of the target word of the original document, and include the terminology in the target word of the original document The quantity of word in conjunction, determine the index degree of the original document in the conjunction of the second class gathering.
17. a kind of server, including memory and processor, the memory is used to store the information for including programmed instruction, institute State the execution that processor is used for control program instruction, it is characterised in that program is realized that right such as will during the computing device The step of seeking 1-8 any methods describeds.