CN117094390A - Knowledge graph construction and intelligent search method oriented to ocean engineering field - Google Patents
Knowledge graph construction and intelligent search method oriented to ocean engineering field Download PDFInfo
- Publication number
- CN117094390A CN117094390A CN202311058328.4A CN202311058328A CN117094390A CN 117094390 A CN117094390 A CN 117094390A CN 202311058328 A CN202311058328 A CN 202311058328A CN 117094390 A CN117094390 A CN 117094390A
- Authority
- CN
- China
- Prior art keywords
- data
- entity
- ocean engineering
- field
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 82
- 238000010276 construction Methods 0.000 title claims abstract description 24
- 238000000605 extraction Methods 0.000 claims abstract description 29
- 239000013598 vector Substances 0.000 claims description 45
- 238000002372 labelling Methods 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 8
- 230000009193 crawling Effects 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000015654 memory Effects 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 6
- 230000006399 behavior Effects 0.000 claims description 5
- 238000010801 machine learning Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 4
- 238000003672 processing method Methods 0.000 claims description 4
- 230000003068 static effect Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000013075 data extraction Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 claims description 3
- 238000012407 engineering method Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 230000004044 response Effects 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 238000011160 research Methods 0.000 abstract description 6
- 238000013461 design Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 description 10
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000006386 memory function Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/243—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a knowledge graph construction and intelligent search method oriented to the field of ocean engineering, which comprises the following five steps: (1) constructing a marine engineering field body; (2) The obtained ocean engineering field data comprises structured data, semi-structured data and unstructured data; (3) performing knowledge recognition on the obtained unstructured data; (4) Storing the structured data and the semi-structured data in the step (2) and the unstructured data after knowledge extraction in the step (3) in a Neo4j graph database to form a knowledge graph in the field of ocean engineering; (5) And (3) realizing intelligent search for the search problem of the domain user through the ocean engineering domain knowledge graph established based on the step (4). The invention fills the blank of application research of the knowledge graph in the field of ocean engineering, and designs and realizes the ocean engineering intelligent search system based on the knowledge graph. The method and the thought of the system are provided for the application of the knowledge graph in the field of ocean engineering.
Description
Technical Field
The invention relates to a knowledge graph construction method, in particular to a construction and intelligent search method of a domain knowledge graph.
Background
Google in 2012 puts forward Google Knowledge Graph, and the knowledge graph formally gets the name, and google improves the performance of a search engine through the knowledge graph technology. Under the vigorous development of artificial intelligence, key problems such as knowledge extraction, representation, fusion, reasoning, question and answer related to a knowledge graph are solved and broken through to a certain extent, and the knowledge graph becomes a new hot spot in the field of knowledge service and is widely focused by students at home and abroad and industry. The knowledge graph describes concepts, entities and relationships thereof in the objective world in a structured form, and the information of the internet is expressed in a form closer to the human cognitive world, thereby providing a better ability to organize, manage and understand massive information.
Traditionally, knowledge maps can be divided into general knowledge maps and domain knowledge maps. The general knowledge graph emphasizes the breadth and aims at more entities, relations and triples. The domain knowledge graph is mainly oriented to specific industries or subdivision domains and mainly solves the professional problem. The domain knowledge graph is oriented to the knowledge graph of a specific domain, and a user target object needs to consider all levels of personnel in the industry, and the operation and business scenes corresponding to different personnel are different, so that a certain depth and completeness are needed. The accuracy requirement of the industry knowledge graph is very high, and the method is generally used for assisting various complex analysis applications or decision support, has strict and rich data modes, and has more general attributes and industrial significance for entities in the industry knowledge graph. The industry knowledge graph refers to a knowledge graph oriented to a specific field, the entity and the triplet have more industry significance, the user object needs to consider various personnel in the industry, the user object corresponds to a field subdivision scene, the user object is used for auxiliary decision support in the industry besides general applications such as searching, question answering and the like, and the depth, the precision and the completeness are more focused.
Ocean science is distributed in multidisciplinary, ocean technology relates to different fields, ocean engineering shows cross fusion of different science, technology and engineering, and the characteristics that ocean data has multisource isomerism are also determined. The traditional data search function depending on an information system cannot meet the current ocean business requirements, and meanwhile, semantic understanding and knowledge transformation of massive heterogeneous ocean data are difficult to achieve. The knowledge graph technology provides an effective means for ocean data utilization, and the knowledge graph forms a huge and interrelated knowledge network by acquiring and fusing rich semantic knowledge from ocean engineering field data, so that automatic mining and reasoning of potential hidden information and semantic understanding and quick and accurate searching of massive data are realized. The knowledge graph is utilized to effectively popularize the ocean knowledge for the ocean knowledge question and answer in the ocean engineering field; the intelligent question-answering system oriented to ocean discipline concepts, ocean literature data, ocean engineering and ocean natural resource management can effectively promote the development of ocean professions and cross disciplines thereof. Because the ocean discipline relates to a wider range, the current ocean information technology is far less in the ocean field than the disciplines such as physical ocean, ocean remote sensing, ocean forecasting and the like, the knowledge-graph technology is not deeply researched in the ocean field, and the knowledge-graph technology is only remained on the basis of ocean related literature data and other resources to construct an ocean knowledge-graph. The application scene of the ocean domain knowledge graph is fuzzy, engineering domain research for developing, utilizing and protecting ocean resources by using ocean projects is lacking, and the knowledge graph cannot be deeply utilized by natural resource managers, ocean scientific research institutions, ocean domain demonstration and demonstration institutions and the like in the ocean engineering domain at present. The prior researches only design a knowledge graph containing the word "ocean" from ocean literature and open resources, do not provide clear definition or explanation for the knowledge in the field of ocean engineering, and do not use real data in the field of ocean engineering.
How to construct a domain ontology under the condition that the ocean engineering domain lacks standard knowledge definition, how to expand the range of a search scene under the condition that the existing information system in the ocean engineering domain does not use non-electronic resources but only uses structured data, and how to construct a reasonable ocean engineering intelligent search system based on a knowledge graph so as to enable related personnel to obtain efficient knowledge service are the problems to be solved in the ocean engineering domain. In view of the above research background, the invention aims to explore and construct a body of ocean engineering field related to ocean natural resources, ocean science and ocean encyclopedias, and through a knowledge graph technology, try to construct a field knowledge graph oriented to Jiangsu ocean by using a common method established by the whole process of the knowledge graph, and provide knowledge services such as entity relationship visualization, search, question-answering and the like. The invention can acquire and fuse abundant semantic knowledge from mass data, form a huge and interrelated knowledge network in the field of ocean engineering, realize automatic mining and reasoning of potential hidden information, and meet semantic understanding and rapid and accurate searching of mass ocean data. The invention is applied to the development, utilization, protection, supervision and monitoring of sea areas, ocean scientific research and the like of Jiangsu ocean.
Disclosure of Invention
The invention aims to: in order to solve the defects of the prior art, the invention provides a knowledge graph construction and intelligent search method oriented to the field of ocean engineering.
The technical scheme is as follows: in order to achieve the above purpose, the present invention adopts the following technical scheme:
the invention discloses a knowledge graph construction and intelligent search method oriented to the field of ocean engineering, which comprises the following steps:
(1) Constructing a body in the field of ocean engineering;
(2) The ocean engineering field body obtained in the step (1) is used as a basis for obtaining ocean engineering field data, and the obtained ocean engineering field data comprises three heterogeneous data types of structured data, semi-structured data and unstructured data;
(3) Knowledge identification is carried out on unstructured data obtained in the step (2), and the method comprises the following steps: entity identification, relationship identification and attribute identification;
in the entity identification process, a BERT-BiLSTM-CRF model is adopted as an entity identification model in the field of ocean engineering, wherein the entity identification model is respectively an input layer, a BERT layer, a BiLSTM layer and a CRF layer from bottom to top; the input layer is the concatenation of special identifiers [ CLS ], [ SEP ] and sentences to be identified; the BERT layer represents each character entered as a vector form; the BiLSTM layer is used for learning the contextual representation of the vector, extracting text features, wherein the forward LSTM learns the representation of the vector above from front to back, the backward LSTM learns the representation of the vector below from back to front, and vector representations in two directions are spliced to be output to the next layer as a final vector; the CRF layer adds constraint conditions to the output of the BiLSTM layer by utilizing the dependency relationship among the labels, and converts the feature vector into an optimal sequence label to obtain a named entity in the field of ocean engineering;
in the relationship identification process, the position information of the head entity and the tail entity is added based on an R-BERT model to enhance the performance of relationship identification; in the data preprocessing step, identifiers "$" and "#" are respectively added to the header entity e 1 And tail entity e 2 For marking the position information of the entity, forms for each input sequence the form shown in the following formula:
[CLS]x 1 ,x 2 ,...,x n-j ,$e 1 $,x n-j+k ,...,x n-i ,#e 2 #,x n-i+t ,...,x n [SEP] (1)
wherein n represents the length of an input sequence, j, k represents sequence positions at two sides of a head entity, i and t represent sequence positions at two sides of a tail entity; after encoding the BERT-based model, R-BERT inputs a concatenation of hidden layer vectors of [ CLS ] and an average value of the hidden layer vectors of each entity to the softmax layer for relationship classification;
(4) And (3) knowledge graph storage: storing the structured data and the semi-structured data in the step (2) and the unstructured data after knowledge extraction in the step (3) in a Neo4j graph database to form a knowledge graph in the field of ocean engineering;
(5) And (3) intelligent searching: the intelligent search is realized by the knowledge graph of the ocean engineering field established based on the step (4) aiming at the search problem of the field user; the user inputs a user search sentence, extracts an entity s mentioned in the search sentence by utilizing a question analysis module, identifies a searched relation and checks whether the search is performed in the field of ocean engineering; constructing a Cypher query statement according to the input information of the identification user, and finally outputting a required result of the user;
the question analysis module uses a BERT-CRF model to identify entity names in questions, the relation of search is identified by adopting a dictionary matching method to identify relation words, the relation dictionary comprises relation words between entities and attribute relation words of the entities, the last check of the question analysis module is used for judging whether search sentences belong to the field of ocean engineering or not, if so, the search sentences enter a Cypher query sentence construction module, otherwise, a reminder is output to remind a user of searching in the field; after the entity name and the related words are identified, the user search answer is returned by constructing a Cypher query sentence, and querying in Neo4 j.
Further, the constructing ocean engineering field body in the step (1) comprises the following steps: IDEF5 method, skeleton method, TOVE method, methodolo method, SENSUS method, KACTUS engineering method, and five-step cycle method.
Further, the step (2) of obtaining the structured data is to convert the relational database into a knowledge triplet with semantic information after being derived from the relational database, and directly map the data structure of the relational database into an RDF graph; and a one-to-one mapping relation between the header fields in the relational database and the entity attributes in the ocean ontology is established by adopting a DM mapping language, so that knowledge extraction of the structured data is realized.
Further, the semi-structured data in the step (2) is obtained by adopting a method based on a wrapper, and the processing method comprises the following specific steps:
the realization of the wrapper comprises three steps of crawling data, analyzing the data and storing the data, firstly, an HTTP request is sent to a server where a given URL is located, the server returns a Response object containing URL webpage content, library functions for realizing the HTTP request in Python language comprise URL lb/URL b2, re and requests, then a webpage analyzer is used for extracting the data in the HTML webpage, the webpage analyzer adopts re or beautifulsource, and finally the data is stored according to a required format; when crawling data, manually analyzing the webpage structure aiming at webpages of different sites by adopting a manual method, designing specific webpage extraction rules for different webpages, and then realizing batch data extraction of the webpages of the same type by adopting an automatic extraction method; for a static loaded webpage, adopting Requests to obtain webpage source codes through an initial URL, and then analyzing label content in the webpage by combining a webpage analyzer to obtain target data; the dynamically loaded web page is realized by using a Selenuim, the Selenuim realizes real-time update of the web page data by simulating the operation behaviors of a user, and the operation behaviors of the user comprise clicking a button and inputting a text.
Further, the step (3) further includes a step of preprocessing a sea area with unstructured data in a PDF format by using a arguments report, firstly, a document analyzer is written by using Python to read report text from the PDF file, paragraphs related to business data are screened from the text to construct a knowledge extraction corpus, then, a character-based BIO labeling method is adopted to manually label entities and relations in the corpus, B in the BIO labeling method represents the beginning of the entity, I represents the middle part of the entity, O represents the non-forming entity, and corpus labeling is completed by means of an open source text labeling tool BRAT.
Further, the unstructured data entity identification of step (3) comprises the following methods: rules and dictionary based methods, statistical machine learning based methods, deep learning based methods.
Further, the step (3) of unstructured data entity identification comprises the following specific steps:
adopting a BERT-BiLSTM-CRF model as an entity recognition model, wherein the entity recognition model comprises an input layer, a BERT layer, a BiLSTM layer and a CRF layer from bottom to top; the input layer is the concatenation of special identifiers [ CLS ], [ SEP ] and sentences to be identified; the BERT layer represents each character entered as a vector form; the BiLSTM layer is used for learning the contextual representation of the vector, extracting text features, wherein the forward LSTM learns the representation of the vector above from front to back, the backward LSTM learns the representation of the vector below from back to front, and vector representations in two directions are spliced to be output to the next layer as a final vector; the CRF layer adds constraint conditions to the output of the BiLSTM layer by utilizing the dependency relationship among the labels, and converts the feature vector into an optimal sequence label to obtain a named entity in the field of ocean engineering;
the named entities in the ocean engineering field part memorize long sequences through a long-short-time memorizing network LSTM, and in the LSTM, the calculation formulas of all hidden states at each moment t are as follows:
F t =σ(W f x t +U f h t-1 +b f )
I t =σ(W i x t +U i h t-1 +b i )
O t =σ(W o x t +U o h t-1 +b o )
H t =O t ⊙tanh(C t )
wherein I is t ,F t ,O t Respectively representing an input gate, a forget gate and an output gate of an LSTM network, wherein W and U are weight matrixes, b f 、b i 、b o 、b c For the bias, σ is the activation function,representing candidate memory cells, C t Representing a memory cell, H t Represents hidden state output, O t 1 represents output, O t A0 represents a reset, and a Hadamard product;
then the BiLSTM is adopted to splice the forward sequence representation and the reverse sequence representation of the text to be used as the final vector representation of the text, the semantic information of the context is captured through the forward and reverse feature learning of the text, and the hidden state of the BiLSTM is output H at the time t t 、O t Expressed as:
O t =H t W hq +b q
and finally, learning the dependency relationship among the labels by adopting a CRF discrimination model, and adding constraint conditions to the label sequence to obtain an optimal label sequence under the global condition: assuming that the output of BiLSTM is matrix P, the transfer matrix between labels is a, for the input sequence x= (X) 1 ,x 2 ,...,x n ) Output sequence y= (Y) 1 ,y 2 ,...,y n ) The score S (X, Y) of (a) is expressed as:
wherein the method comprises the steps ofRepresenting the i-th character corresponding label y i Score of->The probability distribution P (y|x) of the output sequence Y after normalization, representing that the next tag of the tag i is a score of j, is:
representing realityTag sequence, Y of (2) X Representing all possible tag sequences;
the training objective of the CRF model is to maximize the log-likelihood function, and the loss function L of the model is defined as:
L=log(P(Y|X))
the beneficial effects are that: compared with the prior art, the invention adopts the technical scheme as follows:
(1) In the field of ocean knowledge graph, the invention designs a knowledge graph construction and intelligent retrieval method oriented to the field of ocean engineering, which comprises the following steps: by arranging different types of concepts and data in the field of ocean engineering and definitely defining the body of the field of ocean engineering, a novel method for constructing an ocean engineering knowledge graph in a top-down mode is provided;
(2) The invention designs and realizes an intelligent searching system based on a knowledge graph in the field of ocean engineering, and the system can effectively return accurate answers to the searching of users.
Drawings
FIG. 1 is a general frame diagram of the present invention;
FIG. 2 is a flow chart of the ontology construction in the field of ocean engineering;
fig. 3 is a schematic diagram of a text annotation tool, BRAT, annotating unstructured data;
FIG. 4 is a structure of a physical extraction model based on BERT-BiLSTM-CRF;
Detailed Description
The following describes the technical scheme of the invention in detail.
The following is but one example of the present invention and various other embodiments of the present invention, and those skilled in the art can make various corresponding changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, and these corresponding changes and modifications should fall within the scope of the appended claims.
The invention discloses a knowledge graph construction and intelligent search method oriented to the field of ocean engineering, which comprises the following steps:
(1) Ontology construction in ocean engineering field
An ontology is a representation describing concepts and relationships and semantics between concepts within a domain that normalizes knowledge representations within the domain. While fully automated ontology setup is effective, no expert supervision exists and errors remain. In the invention, in the ontology construction process, one part of the concepts and relations in the field of ocean engineering are manually extracted by an expert, and the other part of the concepts and relations in the field of ocean engineering are automatically extracted from the existing ocean information database. The method for constructing the ocean engineering field ontology comprises the following steps: IDEF5 method, skeleton method, TOVE method, methodolo method, SENSUS method, KACTUS engineering method, and five-step cycle method.
In this embodiment, a seven-step process is used to convert the table structure of the marine information system database into classes, attributes, and relationships. The seven-step process for constructing the ocean engineering field ontology is as follows: (1) - (1) determining the category of the ocean engineering field ontology; (1) - (2) consider the possibility of existing ontology reuse; (1) - (3) analyzing and listing important terminology of the ontology; (1) - (4) defining classes and class hierarchies of the ontology; (1) - (5) defining attributes of the class; (1) - (6) defining a constraint relationship; (1) - (7) creating an ontology.
(2) Ocean engineering field data acquisition and processing
And (3) taking the ocean engineering field body in the step (1) as a basis for acquiring ocean engineering field data. The data of the invention comprises three structured, semi-structured and unstructured heterogeneous data types, and the data of different structures and the data processing works are different.
The structured data is from ocean engineering field business data managed and stored by using a relational database, and the structured data of the relational database is converted into knowledge by the existence of clear entity names and corresponding relations among the data; after the structured data is exported from the relational database, the fields and the relations already contain the entities, the attributes and the relations, the step (3) is not needed, and the structured data is directly stored in Neo4j through the step (4) after being processed. In this embodiment, the structured data is derived from the relational database by combining manual screening and service system derivation functions, and the repeated data is directly deleted, so that the field missing problem is explicitly described and standardized by referring to the related report by the domain expert. The structured data processing method comprises the following steps: converting the relational database into a knowledge triplet with semantic information, this process is called RDB2RDF (Relational Database to RDF); directly mapping the data structure of the relational database into an RDF graph; and adopting DM (Direct Mapping) mapping language to establish a one-to-one mapping relation between header fields in a relational database and entity attributes in the ocean ontology, so as to realize knowledge identification of the structured data.
The semi-structured data come from websites such as national ocean science data centers, ocean professional knowledge service systems and the like, the number of sites is small, the web page structure is fixed, and the similarity among the web pages is high; for data acquisition of semi-structured data, a method based on a wrapper is adopted, the wrapper is a web crawler, and the raw components of the wrapper are a manual method, a wrapper induction method and automatic extraction; the semi-structured data of the invention come from websites such as national ocean science data centers, ocean professional knowledge service systems and the like, the number of sites is small, the webpage structure is fixed, and the similarity among the webpages is high, so the invention adopts a method of combining manual method and automatic extraction to realize the extraction of semi-structured knowledge, and a wrapper is generated by utilizing a crawler technology and a webpage analysis technology; after the data are obtained from the ocean engineering field site, the data are preprocessed without the step (3), and the processed data are directly stored in Neo4j through the step (4).
The semi-structured data processing method comprises the following specific steps: the realization of the wrapper comprises three steps of crawling data, analyzing the data and storing the data. Firstly, an HTTP request is sent to a server where a given URL is located, the server returns a Response object containing URL webpage content, library functions for realizing the HTTP request in a Python language comprise URL b/URL b2, re, requests and the like, then a webpage analyzer such as re, beautifulsoup is used for extracting data in the HTML webpage, and finally the data are saved according to a required format; when crawling data, manually analyzing the webpage structure aiming at webpages of different sites by adopting a manual method, designing specific webpage extraction rules for different webpages, and then realizing batch data extraction of the webpages of the same type by adopting an automatic extraction method; the web site used in the invention has two web pages of static loading and dynamic loading, and for the web page of static loading, the web page is realized by adopting Requests, the Requests acquire web page source codes through initial URL, and then the tag content in the web page is analyzed by combining with a web page analyzer to acquire target data; the dynamically loaded web page is realized by using a Selenum, and the Selenum realizes real-time update of the web page data by simulating the operation behaviors of a user, such as clicking a button, inputting a text and the like.
The unstructured data are collected text service reports related to ocean engineering fields such as a sea area use demonstration report and the like, and the format of the unstructured data is PDF and Word. The use of demonstration reports for sea areas where unstructured data is in PDF format also requires pre-processing: firstly, a Python writing document analyzer is utilized to read report text from a PDF file, paragraphs related to business data are screened from the text to construct a knowledge extraction corpus, then, a character-based BIO labeling method is adopted to manually label entities and relations in the corpus, B in the BIO labeling method represents that characters are the beginning of the entities, I represents that the characters are the middle part of the entities, O represents that the characters do not form the entities, and corpus labeling is completed by means of an open source text labeling tool BRAT.
Then, knowledge extraction is carried out by utilizing the step (3), and the position information of the head entity and the tail entity is added based on an R-BERT model in the extraction process to enhance the performance of relation extraction; in the data preprocessing step, identifiers "$" and "#" are respectively added to the header entity e 1 And tail entity e 2 For marking the location information of the entity. For each input sequence, a form is formed as shown in the following formula:
[CLS]x 1 ,x 2 ,...,x n-j ,$e 1 $,x n-j+k ,...,x n-i ,#e 2 #,x n-i+t ,..,x n [SEP] (2)
wherein n represents the length of an input sequence, j, k represents sequence positions at two sides of a head entity, i and t represent sequence positions at two sides of a tail entity; after encoding the BERT-based model, R-BERT inputs a concatenation of hidden layer vectors of [ CLS ] and an average value of the hidden layer vectors of each entity to the softmax layer for relationship classification; the accuracy of the relation extraction is increased by adding identifiers before and after the entity to indicate the entity position instead of the traditional position vector.
(3) Knowledge identification
Step (3) is to process the unstructured data obtained in step (2). Since a large amount of information in the field of ocean engineering exists in PDF, word text formats rather than information systems, knowledge recognition is mainly directed to unstructured data, including: entity extraction, relationship identification, attribute identification, etc. Entity extraction, also known as entity recognition, refers to the automatic recognition of proper nouns from a text corpus. Relationship extraction is the discovery of semantic relationships between named entities from unstructured data. The attribute identification mainly realizes complete description of the entity, and the attribute identification task can be converted into a relationship identification task because the attribute of the entity can be regarded as a noun relationship between the entity and the attribute value; the entity identification is to identify a named entity from a text and mark the named entity as a predefined category, such as a project name, a mechanism name, a geographic position and the like, and the entity identification method can be divided into a rule and dictionary based method, a statistical machine learning based method and a deep learning based method; extracting an entity by constructing a dictionary and entity construction rules based on the dictionary and rule method; the method is characterized in that named entity recognition is regarded as a sequence labeling problem based on a statistical machine learning method, and the recognition of the entity is realized by training a machine learning model by using manually labeled corpus based on manually defined characteristics; after the extraction of the entities in the ocean engineering field is completed, individual entities are obtained, the relation extraction is to extract the semantic relation among the entities from the text, and the association among the entities is established; the invention models the relation extraction as a classification task, carries out type prediction on the relation between entity pairs on the basis of named entity identification, and the relation in the ocean engineering field mainly refers to the relation between entities such as sea projects, project positions, project sea, sea area use modes and the like.
In the embodiment, a BERT-BiLSTM-CRF model is adopted as an entity recognition model, and the entity recognition model is respectively an input layer, a BERT layer, a BiLSTM layer and a CRF layer from bottom to top; the input layer is the concatenation of special identifiers [ CLS ], [ SEP ] and sentences to be identified; the BERT layer represents each character entered as a vector form; the BiLSTM layer is used for learning the contextual representation of the vector, extracting text features, wherein the forward LSTM learns the representation of the vector above from front to back, the backward LSTM learns the representation of the vector below from back to front, and vector representations in two directions are spliced to be output to the next layer as a final vector; and the CRF layer adds constraint conditions to the output of the BiLSTM layer by utilizing the dependency relationship among the labels, and converts the feature vector into an optimal sequence label to obtain a named entity in the field of ocean engineering.
The length of the named entity in the ocean engineering field is longer, a certain dependency relationship exists between characters and context in the ocean entity, the long-short-term memory network LSTM is a special cyclic neural network, the long-term memory network LSTM has a long-sequence memory function, and the information of the last moment can be reserved at the current moment, so that the neural network learns the context characteristic information. In the LSTM network, the calculation formula of each hidden state at each time t is as follows.
F t =σ(W f x t +U f h t-1 +b f )
I t =σ(W i x t +U i h t-1 +b i )
O t =σ(W o x t +U o h t-1 +b o )
H t =O t ⊙tanh(C t )
Wherein I is t ,F t ,O t Respectively representing an input gate, a forget gate and an output gate of the LSTM network, W and U are weight matrices, b is a bias, sigma is an activation function,representing candidate memory cells, C t Representing a memory cell, H t Representing hidden state output (O) t An output of 1, a reset of 0), and a Hadamard product. Since LSTM can only consider forward text sequence information and cannot encode information from back to front when modeling sentences, while BiLSTM splices forward sequence representation and reverse sequence representation of text to be used as final vector representation of text, semantic information of context can be captured through forward and reverse feature learning of text, so BiLSTM H at time t t 、O t Expressed as:
O t =H t W hq +b q
CRF is a discrimination model and is widely applied to tasks such as word segmentation, part-of-speech tagging, named entity recognition and the like. In BIO labeling, labels B and I occur sequentially, and I label can only occur after B label. The output of BiLSTM can independently predict the label corresponding to each character through normalization processing, but the method only considers the local optimal solution, the label sequence can generate illegal labels, the CRF model can learn the dependency relationship among the labels, constraint conditions are added to the label sequence, the optimal label sequence under the global condition is obtained, and the illegal labels are prevented from generating. Assuming that the output of BiLSTM is matrix P, the transfer matrix between labels is a, for the input sequence x= (X) 1 ,x 2 ,...,x n ) Output sequence y= (Y) 1 ,y 2 ,...,y n ) The score of (2) is expressed as:
wherein the method comprises the steps ofRepresenting the i-th character corresponding label y i Score of->The next label representing label i is a score of j. The probability distribution of the output sequence Y after normalization is:
representing the true tag sequence, Y X Representing all possible tag sequences. The training objective of the CRF model is to maximize the log-likelihood function, and the loss function of the model is defined as:
L=log(P(Y|X))
(4) Knowledge graph storage
And (3) storing the structured data, the semi-structured data and the unstructured data in the step (2) in a Neo4j graph database through the step (4). In the embodiment, a Neo4j graph database is selected to store a knowledge graph of the ocean engineering field; knowledge-graph data is typically stored in a database, serving intelligent applications; because entity nodes and relationship edges exist in the knowledge graph, the graph database can provide better performance for the application program than the relationship database, can process a large-scale data set without sacrificing the performance of the application program, and can answer complex queries with high performance; neo4j searches the graph data using the Cypher language; after the step (4) is completed, a knowledge graph of the ocean engineering field is established.
(5) Intelligent searching method
The intelligent search is realized by the knowledge graph of the ocean engineering field established based on the step (4) aiming at the search problem of the field user; the user inputs a user search sentence, extracts an entity s mentioned in the search sentence by utilizing a question analysis module, identifies a searched relation and checks whether the search is performed in the field of ocean engineering; constructing a Cypher query statement according to the input information of the identification user, and finally outputting a required result of the user; identifying entity names in question sentences by using a BERT-CRF model in question sentence entity identification, identifying relation words by using a dictionary matching method in relation identification, wherein the relation dictionary comprises relation words between entities and attribute relation words of the entities, and finally checking whether a search sentence belongs to the field of ocean engineering or not by using a question analysis module, if so, entering a Cypher query sentence construction module, otherwise, outputting a reminder to remind a user of searching in the field; the Neo4j graph database supports a Cypher query statement, so that the data query is convenient, and the retrieval speed is high; after identifying the entity name and the relation word, the user searches in Neo4j by constructing a Cypher query sentence, and returns a user search answer.
Claims (7)
1. The knowledge graph construction and intelligent search method for the ocean engineering field is characterized by comprising the following five steps:
(1) Constructing a body in the field of ocean engineering;
(2) The ocean engineering field body obtained in the step (1) is used as a basis for obtaining ocean engineering field data, and the obtained ocean engineering field data comprises three heterogeneous data types of structured data, semi-structured data and unstructured data;
(3) Knowledge identification is carried out on unstructured data obtained in the step (2), and the method comprises the following steps: entity identification, relationship identification and attribute identification;
in the entity identification process, a BERT-BiLSTM-CRF model is adopted as an entity identification model in the field of ocean engineering, wherein the entity identification model is respectively an input layer, a BERT layer, a BiLSTM layer and a CRF layer from bottom to top; the input layer is the concatenation of special identifiers [ CLS ], [ SEP ] and sentences to be identified; the BERT layer represents each character entered as a vector form; the BiLSTM layer is used for learning the contextual representation of the vector, extracting text features, wherein the forward LSTM learns the representation of the vector above from front to back, the backward LSTM learns the representation of the vector below from back to front, and vector representations in two directions are spliced to be output to the next layer as a final vector; the CRF layer adds constraint conditions to the output of the BiLSTM layer by utilizing the dependency relationship among the labels, and converts the feature vector into an optimal sequence label to obtain a named entity in the field of ocean engineering;
in the relationship identification process, the position information of the head entity and the tail entity is added based on an R-BERT model to enhance the performance of relationship identification; in the data preprocessing step, identifiers "$" and "#" are respectively added to the header entity e 1 And tail entity e 2 For marking the position information of the entity, forms for each input sequence the form shown in the following formula:
[CLS]x 1 ,x 2 ,...,x n-j ,$e 1 $,x n-j+k ,...,x n-i ,#e 2 #,x n-i+t ,...,x n [SEP]
wherein n represents the length of an input sequence, j, k represents sequence positions at two sides of a head entity, i and t represent sequence positions at two sides of a tail entity; after encoding the BERT-based model, R-BERT inputs a concatenation of hidden layer vectors of [ CLS ] and an average value of the hidden layer vectors of each entity to the softmax layer for relationship classification;
(4) And (3) knowledge graph storage: storing the structured data and the semi-structured data in the step (2) and the unstructured data after knowledge extraction in the step (3) in a Neo4j graph database to form a knowledge graph in the field of ocean engineering;
(5) And (3) intelligent searching: the intelligent search is realized by the knowledge graph of the ocean engineering field established based on the step (4) aiming at the search problem of the field user; the user inputs a user search sentence, extracts an entity s mentioned in the search sentence by utilizing a question analysis module, identifies a searched relation and checks whether the search is performed in the field of ocean engineering; constructing a Cypher query statement according to the input information of the identification user, and finally outputting a required result of the user;
the question analysis module uses a BERT-CRF model to identify entity names in questions, the relation of search is identified by adopting a dictionary matching method to identify relation words, the relation dictionary comprises relation words between entities and attribute relation words of the entities, the last check of the question analysis module is used for judging whether search sentences belong to the field of ocean engineering or not, if so, the search sentences enter a Cypher query sentence construction module, otherwise, a reminder is output to remind a user of searching in the field; after the entity name and the related words are identified, the user search answer is returned by constructing a Cypher query sentence, and querying in Neo4 j.
2. The knowledge graph construction and intelligent search method for the ocean engineering field according to claim 1, wherein the ocean engineering field body construction in the step (1) comprises an IDEF5 method, a skeleton method, a TOVE method, a methodolo method, a SENSUS method, a KACTUS engineering method and a five-step circulation method.
3. The knowledge graph construction and intelligent search method for the ocean engineering field according to claim 1, wherein the obtaining of the structured data in the step (2) is that after the structured data is derived from a relational database, the relational database is converted into a knowledge triplet with semantic information, and the data structure of the relational database is directly mapped into an RDF graph; and a one-to-one mapping relation between the header fields in the relational database and the entity attributes in the ocean ontology is established by adopting a DM mapping language, so that knowledge extraction of the structured data is realized.
4. The knowledge graph construction and intelligent search method for the ocean engineering field according to claim 1, wherein the semi-structured data in the step (2) is obtained by adopting a method based on a wrapper, and the processing method comprises the following specific steps:
the realization of the wrapper comprises three steps of crawling data, analyzing the data and storing the data, firstly, an HTTP request is sent to a server where a given URL is located, the server returns a Response object containing URL webpage content, library functions for realizing the HTTP request in Python language comprise URL lb/URL b2, re and requests, then a webpage analyzer is used for extracting the data in the HTML webpage, the webpage analyzer adopts re or beautifulsource, and finally the data is stored according to a required format; when crawling data, manually analyzing the webpage structure aiming at webpages of different sites by adopting a manual method, designing specific webpage extraction rules for different webpages, and then realizing batch data extraction of the webpages of the same type by adopting an automatic extraction method; for a static loaded webpage, adopting Requests to obtain webpage source codes through an initial URL, and then analyzing label content in the webpage by combining a webpage analyzer to obtain target data; the dynamically loaded web page is realized by using a Selenuim, the Selenuim realizes real-time update of the web page data by simulating the operation behaviors of a user, and the operation behaviors of the user comprise clicking a button and inputting a text.
5. The knowledge graph construction and intelligent search method for the ocean engineering field according to claim 1, wherein the step (3) further comprises a step of preprocessing a sea area with unstructured data in PDF format by using a arguments report, firstly, a Python writing a document analyzer is used for reading report text from a PDF file, paragraphs related to business data are screened from the text to construct a knowledge extraction corpus, then, a character-based BIO labeling method is used for manually labeling entities and relations in the corpus, B in the BIO labeling method represents the beginning of the character as the entity, I represents the middle part of the character as the entity, O represents the character as not to form the entity, and corpus labeling is completed by means of an open source text labeling tool BRAT.
6. The knowledge graph construction and intelligent search method for the field of ocean engineering according to claim 1, wherein the unstructured data entity identification in the step (3) comprises the following steps: rules and dictionary based methods, statistical machine learning based methods, deep learning based methods.
7. The knowledge graph construction and intelligent search method for the field of ocean engineering according to claim 1, wherein the unstructured data entity identification in the step (3) comprises the following specific steps:
adopting a BERT-BiLSTM-CRF model as an entity recognition model, wherein the entity recognition model comprises an input layer, a BERT layer, a BiLSTM layer and a CRF layer from bottom to top; the input layer is the concatenation of special identifiers [ CLS ], [ SEP ] and sentences to be identified; the BERT layer represents each character entered as a vector form; the BiLSTM layer is used for learning the contextual representation of the vector, extracting text features, wherein the forward LSTM learns the representation of the vector above from front to back, the backward LSTM learns the representation of the vector below from back to front, and vector representations in two directions are spliced to be output to the next layer as a final vector; the CRF layer adds constraint conditions to the output of the BiLSTM layer by utilizing the dependency relationship among the labels, and converts the feature vector into an optimal sequence label to obtain a named entity in the field of ocean engineering;
the named entities in the ocean engineering field part memorize long sequences through a long-short-time memorizing network LSTM, and in the LSTM, the calculation formulas of all hidden states at each moment t are as follows:
F t =σ(W f x t +U f h t-1 +b f )
I t =σ(W i x t +U i h t-1 +b i )
O t =σ(W o x t +U o h t-1 +b o )
H t =O t ☉tanh(C t )
wherein I is t ,F t ,O t Respectively representing an input gate, a forget gate and an output gate of an LSTM network, wherein W and U are weight matrixes, b f 、b i 、b o 、b c For the bias, σ is the activation function,representing a waitingMemory cell selection, C t Representing a memory cell, H t Represents hidden state output, O t 1 represents output, O t A0 represents a reset, and a Hadamard product;
then the BiLSTM is adopted to splice the forward sequence representation and the reverse sequence representation of the text to be used as the final vector representation of the text, the semantic information of the context is captured through the forward and reverse feature learning of the text, and the hidden state of the BiLSTM is output H at the time t t 、O t Expressed as:
O t =H t W hq +b q
and finally, learning the dependency relationship among the labels by adopting a CRF discrimination model, and adding constraint conditions to the label sequence to obtain an optimal label sequence under the global condition: assuming that the output of BiLSTM is matrix P, the transfer matrix between labels is a, for the input sequence x= (X) 1 ,x 2 ,...,x n ) Output sequence y= (Y) 1 ,y 2 ,...,y n ) The score S (X, Y) of (a) is expressed as:
wherein the method comprises the steps ofRepresenting the i-th character corresponding label y i Score of->The probability distribution P (y|x) of the output sequence Y after normalization, representing that the next tag of the tag i is a score of j, is:
representing the true tag sequence, Y X Representing all possible tag sequences;
the training objective of the CRF model is to maximize the log-likelihood function, and the loss function L of the model is defined as:
L=log(P(Y|X))。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311058328.4A CN117094390A (en) | 2023-08-22 | 2023-08-22 | Knowledge graph construction and intelligent search method oriented to ocean engineering field |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311058328.4A CN117094390A (en) | 2023-08-22 | 2023-08-22 | Knowledge graph construction and intelligent search method oriented to ocean engineering field |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117094390A true CN117094390A (en) | 2023-11-21 |
Family
ID=88774745
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311058328.4A Pending CN117094390A (en) | 2023-08-22 | 2023-08-22 | Knowledge graph construction and intelligent search method oriented to ocean engineering field |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117094390A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117973383A (en) * | 2024-04-01 | 2024-05-03 | 山东大学 | Word segmentation labeling and entity extraction method and system for robot flow automation |
-
2023
- 2023-08-22 CN CN202311058328.4A patent/CN117094390A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117973383A (en) * | 2024-04-01 | 2024-05-03 | 山东大学 | Word segmentation labeling and entity extraction method and system for robot flow automation |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112199511B (en) | Cross-language multi-source vertical domain knowledge graph construction method | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
CN111428053B (en) | Construction method of tax field-oriented knowledge graph | |
Levy et al. | Intelligent internet systems | |
CN110727779A (en) | Question-answering method and system based on multi-model fusion | |
Zubrinic et al. | The automatic creation of concept maps from documents written using morphologically rich languages | |
CN111639171A (en) | Knowledge graph question-answering method and device | |
CN113535917A (en) | Intelligent question-answering method and system based on travel knowledge map | |
CN112417100A (en) | Knowledge graph in Liaodai historical culture field and construction method of intelligent question-answering system thereof | |
CN114238653A (en) | Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education | |
CN116127084A (en) | Knowledge graph-based micro-grid scheduling strategy intelligent retrieval system and method | |
Chai | Design and implementation of English intelligent communication platform based on similarity algorithm | |
Chen | An intelligent question-answering system for course learning based on knowledge graph | |
CN117094390A (en) | Knowledge graph construction and intelligent search method oriented to ocean engineering field | |
Wu et al. | PaintKG: the painting knowledge graph using bilstm-crf | |
CN115270776A (en) | Method, system, device and medium for automatically acquiring concepts in domain knowledge base | |
CN113901224A (en) | Knowledge distillation-based secret-related text recognition model training method, system and device | |
Lux et al. | From folksonomies to ontologies: employing wisdom of the crowds to serve learning purposes | |
CN118467985A (en) | Training scoring method based on natural language | |
CN116484023A (en) | Method and system for constructing power industry knowledge base based on artificial intelligence | |
Yang et al. | A general solution and practice for automatically constructing domain knowledge graph | |
Panditharathna et al. | Question and answering system for investment promotion based on nlp | |
Zhu et al. | OEIS: Knowledge Graph based Intelligent Search System in Ocean Engineering | |
Stoyanova-Doycheva et al. | Structure of an ontology used in a test generation environment | |
Fu | Design of a Higher Education Question and Answer System Based on Multimodal Adversarial Networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |