CN114328951B

CN114328951B - Knowledge graph construction method integrating information acquisition and triplet extraction

Info

Publication number: CN114328951B
Application number: CN202111538747.9A
Authority: CN
Inventors: 程良伦; 叶海明; 张伟文
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2024-08-27
Anticipated expiration: 2041-12-15
Also published as: CN114328951A

Abstract

The invention relates to a knowledge graph construction method for fusion information acquisition and triplet extraction, which comprises the following steps: s1: periodically crawling marine related text contents including news from a specified webpage by utilizing a crawler technology; s2: performing entity extraction and relation extraction on the text content by using a natural language processing tool to obtain a news triplet, and then storing the news triplet into a database; s3: constructing a knowledge graph according to the triples in the database, and realizing the visualization of the knowledge graph in a data browser; s4: and acquiring the association of the knowledge according to the visualized knowledge graph. In the scheme, information acquisition and triplet extraction are fused to construct a knowledge graph, and the whole flow is constructed into an end-to-end task, so that the use cost of a user is reduced; the ocean knowledge graph is constructed, knowledge is found out from scattered data, the relevance of things is mined, and the organization is helped to make guiding decisions.

Description

Knowledge graph construction method integrating information acquisition and triplet extraction

Technical Field

The invention relates to the field, in particular to a knowledge graph construction method for fusion information acquisition and triplet extraction.

Background

With the continued advancement of global integration, the better and faster development of the marine industry has been the national primary strategic goal. The marine industry is a broad field, including industrial directions of marine fishery, marine economy, marine military, marine environmental protection, and the like. Because of certain relation among industries and certain difference among the characteristics of the industries, how to realize information communication and linkage utilization among the industries becomes a bottleneck for developing the marine industry. Along with the development of technology, the knowledge graph is used as a tool for revealing the dynamic development rule of the knowledge field and providing a practical and valuable reference for disciplinary research, and gradually enters the field of vision of people. At present, the work on the knowledge base of the marine industry is still basically blank due to the professional information, so that the construction of a large-scale semantic knowledge base of the emerging marine industry becomes a requirement.

Knowledge graph is a semantic representation of real world, wherein each node represents an entity, the edges connecting the nodes correspond to the relationship between the entities, and heterogeneous data is expressed as knowledge through integration, and the knowledge is information corresponding to points or edges. Knowledge maps organize all entities with relationships to form a directed graph structure, and the representation of the graph maps the human cognitive way to the world. The knowledge graph displays the complex knowledge field and knowledge system through data mining, information processing, knowledge metering and graphic drawing, reveals the development dynamics and rules of the field, provides the reference of omnibearing, integral and relation chain for the research of the field, and is important production data of the intelligent society. The traditional research direction is mainly focused on constructing a knowledge graph, such as knowledge updating, knowledge fusion and the like, and in specific practice, data acquisition and data cleaning are required to be considered besides the final construction of the knowledge graph, because the quality of the final knowledge graph is deeply influenced by the two steps.

The prior art discloses a system and a method for promoting structured knowledge graph construction by utilizing group learning behaviors, wherein the system comprises an information extraction module, an individual knowledge graph construction module, a group knowledge graph construction module, a knowledge graph construction module and a feedback module. The invention is suitable for middle and primary school teaching and online learning, and a learner establishes the association of a knowledge point Ki and a topic knowledge point set K { K1, K2, K3 … Ki … Kt } to which the Ki belongs through the operation of the related content of the knowledge point Ki in the learning process of the knowledge point Ki to form an individual knowledge graph, wherein the knowledge association is expressed in the form of triples < acceptance, relationship and property >; the individual knowledge patterns of a plurality of learners are aggregated to form a group knowledge pattern Gc; performing reliability analysis on all triples in the group knowledge graph, removing triples which do not reach a reliability threshold value, obtaining a knowledge point graph Gk authoritative by a knowledge point Ki, and perfecting the formed knowledge point graph by an expert; the final knowledge point map can provide feedback for the study of a learner, but the information cannot be communicated and utilized in a linkage way in the aspect of information acquisition.

Disclosure of Invention

The invention provides a knowledge graph construction method for fusion information acquisition and triplet extraction, which aims to solve the technical defects that how to realize information penetration and linkage utilization among various existing industries, knowledge cannot be found in scattered data, and the use cost of a user is increased.

In order to achieve the aim of the invention, the technical scheme adopted is as follows:

a knowledge graph construction method integrating information acquisition and triplet extraction comprises the following steps:

S1: periodically crawling marine related text contents including news from a specified webpage by utilizing a crawler technology;

S2: performing entity extraction and relation extraction on the text content by using a natural language processing tool to obtain a news triplet, and then storing the news triplet into a database;

S3: constructing a knowledge graph according to the triples of news in the database, and realizing the visualization of the knowledge graph in a data browser;

S4: and acquiring the association of the knowledge according to the visualized knowledge graph.

In the scheme, information acquisition and triplet extraction are fused to construct a knowledge graph, and the whole flow is constructed into an end-to-end task, so that the use cost of a user is reduced; the ocean knowledge graph is constructed, knowledge is found out from scattered data, the relevance of things is mined, and the organization is helped to make guiding decisions.

Preferably, step S1 comprises the steps of:

S11: setting a target crawler address, simply analyzing html language of a website, and completing a crawler frame according to a template to obtain news;

In the above scheme, the following fields are parsed and stored: news headlines (news_title), news types (news_type), news dates (news_date), news summaries (news_summary), news bodies (news_content), news web pages (news_web_url), and news sources (news_source).

The database news table fields are as follows:

S12: searching in a database according to the URL of the news as the identifier so as to judge whether the news is repeated or not;

s13: in order to simplify the user operation, a timing function is set so that the crawler program periodically crawls news of a specified website according to the setting of the user.

Preferably, in step S12, the initial dataset is constructed and the news of the designated type and the start and end pages thereof are crawled to the repeated news, and the news is skipped and the next news is continued; when the repeated news is crawled, the secondary website of the news type is crawled, and then the next secondary website of the news type is crawled.

Preferably, step S2 comprises the steps of:

s21: constructing a relation triplet data table, analyzing and storing the fields;

In the above scheme, the following fields are parsed and stored: index (id), subject (triple_subject), subject tag (triple_subject_label), verb (triple_verb), object (triple_subject), object tag (triple_subject_label), original (original_text), table (original_text_table) in which the original is located, and index (original_text_table_id) in which the original is located.

The database triplet relationship table fields are as follows:

S22: when reading the data in the database obtained in the information acquisition stage, firstly reading the data in the news headline field into a cache, and marking the parts of speech of the news headline read into the cache;

S23: carrying out syntactic analysis on sentences of news, and maintaining a dictionary for storing syntactic dependency child nodes for each word in the sentences;

S24: carrying out semantic analysis on sentences by utilizing the results of S21 and S22 and labeling semantic roles;

s25: performing triplet extraction by means of the results of S21, S22 and S23;

in the above scheme, if the semantic role flag is empty, the dependency syntax is used to extract predicate-centric fact triples: mainly processes the three conditions, namely, a main guest; post-fixed language, moving guest relation; the main and auxiliary relationships contain the meson relationship.

S26: screening out repeated triples, removing meaningful punctuation marks by using a regular expression, writing the triples into a triples table, and storing the triples of news into a database.

Preferably, in step S22, it is checked whether the data has been extracted before the triplet extraction, and if so, skipped; if not, word segmentation is carried out by using a natural language processing platform, and word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification;

Part-of-speech tagging is a text data processing technique in corpus linguistics that tags parts-of-speech of words in the corpus according to their meaning and contextual content.

Preferably, in step S23, the sentence is parsed to maintain a dictionary of syntactically dependent child nodes for each word in the sentence;

The syntactic analysis refers to analyzing the grammar function of words in sentences, is a direct realization of the idea of language block analysis, and simplifies the description of sentences by identifying high-level structural units; and (5) using the results of S21 and S22 to find child nodes of the sentence.

Preferably, in step S24, the semantic role labeling is a shallow semantic analysis technology, and the predicate-argument structure of the sentence is analyzed by taking the sentence as a unit, so that the semantic information contained in the sentence is not deeply analyzed; the task of semantic role labeling is to study the relationship between each component in a sentence and predicates centered on the predicates of the sentence, and describe the relationship between them with semantic roles.

Preferably, step S3 comprises the steps of:

s31: constructing a triplet data table according to triples in the database, sequentially acquiring triplet relations from the triplet data table, and acquiring subjects, subject labels, predicates, objects and object labels in the triplet relations;

S32: searching whether the subject and the object exist in the knowledge graph by using a map query language Cypher of Neo4 j; if the same subject exists, inquiring whether the same subject label exists; if the same object exists, inquiring whether the same object tag exists;

S33: using a Cypher statement to try to insert subjects, subject labels, predicates, objects and object labels into the knowledge graph, and if the knowledge graph is unsuccessful, judging that the relationship does not exist;

s34: and adding new subject labels and object labels for the relations, creating new relations, and after writing all the triplet relations, starting neo4j service by a user and entering a visual interface of the knowledge graph.

Preferably, in step S32, if the same subject exists, it is queried whether the same subject tag exists: if the same subject label does not exist, a new subject label is added for the subject; if the same subject labels exist, the next step is carried out; if the same subject does not exist, a new node is created on the knowledge graph;

If the same object exists, then query whether the same object tag exists: if the same object label does not exist, a new object label is added for the subject; if the same object tag exists, the process advances to step S33. If the same object does not exist, a new node is created on the knowledge graph.

Preferably, in step S4, the knowledge in the knowledge graph is displayed in the relationship of the triples, which includes a subject, a subject tag, a predicate, an object, and an object tag, and a knowledge base is used to query a certain element alone, obtain information of other elements associated with the certain element according to the element, and if overlapping elements exist among the triples, indicate that a certain association exists among the triples.

Compared with the prior art, the invention has the beneficial effects that:

According to the knowledge graph construction method integrating information acquisition and triplet extraction, the knowledge graph is constructed by integrating the information acquisition and triplet extraction, the whole process is constructed into an end-to-end task, and the use cost of a user is reduced; the ocean knowledge graph is constructed, knowledge is found out from scattered data, the relevance of things is mined, and the organization is helped to make guiding decisions.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a flow chart of a method of information acquisition according to the present invention;

FIG. 3 is a flow chart of a method of triad extraction according to the present invention;

FIG. 4 is a flow chart of a method of knowledge graph construction of the present invention;

FIG. 5 is a flow chart of information acquisition according to the present invention;

FIG. 6 is a flowchart of triplet extraction in accordance with the present invention;

FIG. 7 is a flow chart of knowledge graph construction in accordance with the present invention;

FIG. 8 is a diagram showing the effect of the term "Guangdong" search in the knowledge graph according to the present invention;

FIG. 9 is a diagram showing the effect of developing "ocean economic province" in the knowledge graph of the present invention;

FIG. 10 is a diagram showing the effect of developing the "China sea police boat" in the knowledge graph according to the present invention;

fig. 11 is a diagram showing the effect of developing the invention on a "one 20 ten thousand ton class uk ship" within a knowledge graph.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

The invention is further illustrated in the following figures and examples.

Example 1

As shown in fig. 1,2, 3 and 4, a knowledge graph construction method for acquiring fusion information and extracting triples includes the following steps:

Preferably, step S1 comprises the steps of:

The database news table fields are as follows:

Preferably, step S2 comprises the steps of:

The database triplet relationship table fields are as follows:

s25: performing triplet extraction by means of the results of S21, S22 and S23;

Preferably, step S3 comprises the steps of:

Example 2

As shown in fig. 5, the crawler technology is used to periodically crawl marine related text content including news from a designated web page, which comprises the following steps:

setting a target crawler address, simply analyzing html language of a website, and completing a crawler frame according to a template. For the next processing we will parse and store the following fields: news headlines, news types, news dates, news summaries, news bodies, news web pages, and news sources.

And searching in a database according to the URL of the news as an identifier so as to judge whether the news is repeated. When the news of the appointed type and the start and end pages are crawled to the repeated news, the news can be skipped, and the next news can be continued. Crawling updated news differs from crawling the next type of secondary web site to crawl when repeated news is crawled, indicating that the secondary web site of that news type has been crawled.

Finally, to simplify the user operation, we set a timing function, and the crawler program can periodically crawl news of a specified website according to the user's settings.

Example 3

As shown in fig. 6, entity extraction and relation extraction are performed on text contents by using a natural language processing tool to obtain news and triples, and then the news and triples are stored in a database, which comprises the following steps:

first, construct a relational triplet data table, we will parse and store the following fields: index, subject label, verb, object label, original, table of original, and index of table of original

Secondly, reading the data in the database obtained in the information acquisition stage, firstly, attempting to read the data in the news headline (news_title) field into the cache. Checking whether the data is extracted before the triplet extraction, and skipping if so; if not, the halving natural language processing platform (pyltp) is utilized to perform word segmentation, and word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification. Such as: news headlines: the offshore drilling work is finished in the first submarine high-speed rail tunnel in China, and the results after word segmentation are as follows: 'domestic', 'first', 'strip', 'subsea', 'high-speed rail', 'tunnel', 'already', 'complete', 'offshore', 'drilling', 'working'.

The news headlines (news_title) that have been read into the cache are then part-of-speech tagged. Part-of-speech tagging is a text data processing technique in linguistic corpus that tags parts-of-speech of words in the corpus according to their meaning and context, such as: news headlines: the offshore drilling work is finished in the first submarine high-speed rail tunnel in China, and the part of speech tagging is carried out by using the result of S1: [ 'nl','m', 'q', 'n', 'n', 'n','d', 'v', 'nl', 'v', 'v' ]. If the one-to-one correspondence with the S1 result is shown as follows: { 'domestic': 'nl', 'first': m ',' strip ',' q ',' subsea ':' n ',' high speed rail ',' tunnel ',' n ',' already-freighted ',' v ',' complete ',' offshore ',' nl ',' drilling ',' working ',' v ','.

Then, the sentence is parsed, and a dictionary of syntactically dependent child nodes is maintained for each word in the sentence. The syntactic analysis refers to the analysis of the grammar function of words in sentences, is a direct realization of the concept of the analysis of language blocks, and simplifies the description of sentences by identifying high-level structural units. With the results of the first two steps, child nodes of the sentence can be found. If the title is continuously utilized, the child node can be obtained as ：[{},{},{'ATT':[1]},{},{'ATT':[3]},{'ATT':[0,2,4]},{},{'SBV':[5],'ADV':[6],'VOB':[10]},{},{'ADV':[8]},{'ATT':[9]}].

And carrying out semantic analysis on the sentences by using the results of the first two steps and labeling semantic roles. Semantic role labeling is a shallow semantic analysis technology, which analyzes predicate-argument structures of sentences by taking sentences as units, and does not carry out deep analysis on semantic information contained in the sentences. Specifically, the task of semantic role labeling is to study the relationship between each component in a sentence and predicates centered on the predicates of the sentence, and describe the relationship between them with semantic roles. If the news headline is continuously utilized, the result after semantic role marking is as follows: {7 { 'A0': [ 'A0',0,5], 'ADV': [ 'ADV', 6], 'A1': [ 'A1',8,10] }; if the news headline corresponds to the news headline, the news headline is: { "complete" { 'A0': [ 'A0', "first seafloor high-speed rail tunnel in China" ], ADV ': [' ADV ',6 "], A1': [ 'A1'," offshore drilling operation "] }.

And performing triplet extraction by means of the results of the first four steps. If the semantic role flag is null, then the dependency syntax is used to extract predicate-centric fact triples: mainly processes the three conditions, namely, a main guest; post-fixed language, moving guest relation; the main and auxiliary relationships contain the meson relationship. If the news headline is continuously utilized, the triple relation can be obtained by the result of the first four steps: the first seafloor high-speed rail tunnel in China, 'complete', 'offshore drilling operations' ].

Finally, the repeated triplet relationships are screened out and the punctuation marks of the meaning are removed by using the regular expression, and then the triples are written into a triplet table. If the title is empty, then all steps are repeated in the abstract.

Example 4

As shown in fig. 7, a knowledge graph is constructed according to triples in a database, and visualization of the knowledge graph is realized in a data browser, which comprises the following steps:

First, a triple relation is sequentially obtained from a triple data table, and subjects, subject labels, predicates, objects and object labels in the triple relation are obtained. For example, if the triplet is [ 'domestic first seafloor high-speed rail tunnel', 'finish', 'offshore drilling operation', 'domestic first seafloor high-speed rail tunnel', subject label is 'noun', predicate is 'finish', object is 'offshore drilling operation', object label is 'event'.

Second, query language cyto with Neo4j tries to find if the subject and object are present in the knowledge-graph. For example, if the result obtained in S1 is used, the subject query term is: match (n) where n.name= "{ 'first seafloor high-speed tunnel in China' }" return n, object query statement is: match (n) where n.name= "{ 'offshore drilling work' }" return n.

Again, if the same subject exists, then query whether the same subject tag exists: if the same subject label does not exist, a new subject label is added for the subject; if the same subject label exists, the next step is entered. If the same subject does not exist, a new node is created on the knowledge graph.

Then, if the same object exists, it is queried whether the same object tag exists: if the same object label does not exist, a new object label is added for the subject; if the same object tag exists, the next step is entered. If the same object does not exist, a new node is created on the knowledge graph.

Then, the Cypher statement is used to try to insert the subject, subject tag, predicate, object and object tag obtained from the first step in the knowledge graph. For example, if the result of the first step is used, the insert sentence is Match (n: 'noun') [: 'complete' ] - > (m: 'event') where n.name= "domestic first seafloor high-speed rail tunnel" and m.name= "offshore drilling work" return m, and then it is checked whether the insert was successful.

If the relationship is unsuccessful, then it can be determined that the relationship does not exist. New subject and object tags are then added to the relationship, and a new relationship is then created.

Finally, after writing all triples, the user may initiate neo4j services. After entering the visual interface, clicking related buttons and typing a Cypher sentence, and finally, carrying out knowledge reasoning based on the query.

Example 5

As shown in fig. 8, 9, 10 and 11, in the knowledge base, knowledge is presented in a triplet relationship, { subject, subject tag, predicate, object tag }. Where a subject refers to an issuer of an action, a predicate refers to a link of the subject with an object, and an object refers to a recipient of the predicate action. One element can be queried independently through the knowledge base, and then other element information associated with the element can be acquired according to the element. If there are coincident elements between multiple triples, then it is stated that there is some correlation between the triples.

Example 1: all elements associated with this element of "Guangdong" are queried. The results show that, firstly, no arrow points to the Guangdong, which means that the input degree is 0, the output degree is the number of elements which are related to the query, the Guangdong is all taken as a subject in a knowledge base, and we can judge that the captured news is mostly described by standing at the angle of the Guangdong. On the basis of fig. 8, we show the element "ocean economic strong province".

In fig. 9, it can be seen that the subject associated with "ocean economic strong province" also has "Fujian Provincial Party committee province government", which can obtain knowledge from a certain degree of correlation, and understand that "Guangdong" and "Fujian Provincial Party committee province government" have similar construction targets.

Example 2: all elements associated with this element "deep sea" are queried. The results of fig. 10 show that it can be seen first that there are arrows pointing to "deep sea" and arrows pointing from "deep sea" to other elements, which illustrates that the relationship of other elements to "deep sea" is multiple.

We point out one of the elements "work" that forms a prime guest relationship with "deep sea", and can see from fig. 10 and 11 that there are subjects such as "chinese sea police boat", "one 20 ten thousand ton class uk ship" that work in deep sea, so we can know from one element "deep sea" other knowledge associated with it.

It is to be understood that the above examples of the present invention are provided by way of illustration only and not by way of limitation of the embodiments of the present invention. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the invention are desired to be protected by the following claims.

Claims

1. The knowledge graph construction method integrating information acquisition and triplet extraction is characterized by comprising the following steps of:

S4: acquiring knowledge association according to the visualized knowledge graph;

Step S2 comprises the steps of:

s25: performing triplet extraction by means of the results of S21, S22 and S23;

S26: screening out repeated triplet relation, removing significant punctuation marks by using regular expression, writing the triplet into a triplet table, and storing the triplet of news into a database;

Step S3 comprises the steps of:

2. The knowledge graph construction method for fusion information acquisition and triplet extraction according to claim 1, wherein the step S1 comprises the steps of:

3. The knowledge graph construction method for fusion information acquisition and triad extraction according to claim 2, wherein in step S12, the construction of the initial data set and crawling of news of a designated type and the start and end pages thereof skip the news and continue with the next news when crawling of repeated news; when the repeated news is crawled, the secondary website of the news type is crawled, and then the next secondary website of the news type is crawled.

4. A method of knowledge graph construction for fusion information acquisition and triplet extraction as claimed in claim 3, wherein in step S22, it is checked whether the data has been extracted before triplet extraction, and if so, skipped; if not, word segmentation is carried out by using a natural language processing platform, and word segmentation is a process of recombining continuous word sequences into word sequences according to a certain specification;

5. The method according to claim 4, wherein in step S23, the sentence is subjected to syntactic analysis, and a dictionary for storing syntactically dependent child nodes is maintained for each word in the sentence;

6. The method for constructing a knowledge graph with fusion information acquisition and triplet extraction as claimed in claim 5, wherein in step S24, semantic role labeling is a shallow semantic analysis technique, and the predicate-argument structure of a sentence is analyzed in sentence units without further analysis of semantic information contained in the sentence; the task of semantic role labeling is to study the relationship between each component in a sentence and predicates centered on the predicates of the sentence, and describe the relationship between them with semantic roles.

7. The method for constructing a knowledge graph with fusion information acquisition and triad extraction according to claim 6, wherein in step S32, if the same subject exists, it is queried whether the same subject label exists: if the same subject label does not exist, a new subject label is added for the subject; if the same subject labels exist, the next step is carried out; if the same subject does not exist, a new node is created on the knowledge graph;

if the same object exists, then query whether the same object tag exists: if the same object label does not exist, a new object label is added for the subject; if the same object tag exists, the process proceeds to step S33; if the same object does not exist, a new node is created on the knowledge graph.

8. The method for constructing a knowledge graph with fusion information acquisition and triplet extraction according to claim 7, wherein in step S4, knowledge in the knowledge graph is displayed in a relation of triples, and the knowledge graph comprises a subject, a subject tag, a predicate, an object and an object tag, wherein a knowledge base is used for independently inquiring a certain element, and other element information associated with the certain element is acquired according to the element, and if overlapping elements exist among a plurality of triples, a certain association exists among the triples.