CN118428467A

CN118428467A - Knowledge graph intelligent construction method based on deep learning

Info

Publication number: CN118428467A
Application number: CN202410664630.2A
Authority: CN
Inventors: 刘义辉
Original assignee: Beijing Shangboxin Technology Co ltd
Current assignee: Beijing Shangboxin Technology Co ltd
Priority date: 2024-05-27
Filing date: 2024-05-27
Publication date: 2024-08-02

Abstract

The application relates to the technical field of knowledge maps, and provides an intelligent knowledge map construction method based on deep learning, which comprises the following steps: acquiring text data and image data; determining common semantic base similarity based on differences between projection results of elements in the clusters in the corresponding projection matrixes of the clusters; determining the context fusion of the modes based on the word vector, the difference of projection results of the single-mode data descriptors on the consistency matrix under different modes and the expandable overlapping degree of the dual-mode semantics; determining a consistent encoding feature vector based on modality context fusion; determining and updating a dynamic programming adjustment factor based on the result of the comparison learning and the consistent coding feature vector; determining a data fusion result corresponding to each entity node based on the updated dynamic programming adjustment factors; and obtaining a multi-mode knowledge graph based on the data fusion result corresponding to the entity node. The application improves the interactivity self-adaptive determination attention weight of different modal data and improves the semantic credibility of the multi-modal knowledge graph.

Description

Knowledge graph intelligent construction method based on deep learning

Technical Field

The application relates to the technical field of knowledge maps, in particular to an intelligent knowledge map construction method based on deep learning.

Background

The multi-modal data refers to information data of multiple modes, including text data, image data, audio data and the like, information complementation of different dimensions can be carried out among the multi-modal data, and fusion of the multi-modal data can further improve accuracy of data analysis.

At present, a plurality of different construction modes are constructed for the multi-mode knowledge graph, and the multi-mode knowledge graph can be directly constructed by directly carrying out named body recognition, relation extraction and the like on multi-mode data; the simple knowledge graph can be constructed by utilizing single-mode data, and then the simple knowledge graph is complemented by technologies such as entity linking, data fusion, entity embedding matching and the like to obtain the knowledge graph for multiple modes. The knowledge graph construction comprises a plurality of links such as data collection, data processing, entity identification, relation extraction, knowledge fusion and the like. The knowledge fusion generally refers to fusing the identified entities, relationships and attributes, solving the ambiguity problem in entity identification, determining the consistency of knowledge maps, and the higher the robustness of the identified entity information is, the fewer the ambiguity problem is. Therefore, how to eliminate contradiction and ambiguity between data obtained by different data sources by performing data fusion and data analysis on the data of a plurality of data sources is one of the main problems in constructing a knowledge graph.

Disclosure of Invention

The application provides a knowledge graph intelligent construction method based on deep learning, which solves the problem that contradiction and ambiguity among data obtained by different data sources affect the construction of a knowledge graph oriented to multi-mode data, and adopts the following technical scheme:

The application relates to a knowledge graph intelligent construction method based on deep learning, which comprises the following steps:

Respectively acquiring text data and image data from different data sources;

Determining the common semantic base similarity of each element based on the difference between projection results of each element in each cluster in the clustering results of each modal data in the corresponding projection matrix of each cluster;

Determining the modal context fusion between each word and each target area based on the difference of projection results of the word vector and the unimodal data descriptors on the consistency matrix under different modes and the dual-mode semantic expandable overlapping degree between the word vector and the unimodal data descriptors;

Determining a consistent coding feature vector of each cluster based on the modal context fusibility between the elements in the cluster and the other modal cluster; determining an updated dynamic programming adjustment factor of each mode based on the result of contrast learning among the clusters under each mode and the consistent coding feature vector of the clusters;

Determining a data fusion result corresponding to each entity node in the initial knowledge graph based on updated dynamic programming adjustment factors of all modes by adopting a multi-mode fusion model; and complementing the data fusion results corresponding to all the entity nodes in the initial knowledge graph to obtain the multi-mode knowledge graph.

Preferably, the method for determining the common semantic base similarity of each element based on the difference between the projection results of each element in each cluster in the cluster results of each modal data in the projection matrix corresponding to each cluster comprises the following steps:

Acquiring a Word vector of words in each text data sequence and clean image data and a single-mode data descriptor of a target area by using ELMo models and Word2vec models respectively;

A clustering algorithm is adopted to acquire clustering results of the word vector and the target area based on an undirected graph formed by the word vector and the target area of the word;

For any one word vector cluster, taking each word vector in each cluster as a row vector in a matrix, and taking the matrix formed by all word vectors in each cluster as a semantic non-negative matrix of each cluster;

For any cluster of target areas, taking a single-mode data descriptor of each target area in each cluster as a row vector in a matrix, and taking a matrix formed by single-mode data descriptors of all target areas in each cluster as a semantic non-negative matrix of each cluster;

For any cluster, taking a semantic non-negative matrix of each cluster as input, and decomposing the semantic non-negative matrix into a result of multiplying a consistency matrix and a projection matrix by adopting an NMF algorithm;

Taking the pearson correlation coefficient between each element in any cluster and the projection result of each element on the projection matrix in the semantic nonnegative matrix decomposition result of the cluster as the connotation semantic similarity corresponding to each element;

taking the difference value between the meaning semantic similarity of each element and the minimum value of the meaning semantic similarity corresponding to all elements in the cluster where each element is located as a molecule;

taking the sum of the accumulated result of the bit variance between the projection results of each element and the projection results of other elements in the cluster where the element is located and the projection matrix in the semantic nonnegative matrix decomposition result of the cluster and 0.01 as the denominator;

The ratio of the numerator to the denominator is taken as the common semantic base similarity of each element.

Preferably, the method for obtaining the Word vector of the Word in each text data sequence and clean image data and the single-mode data descriptor of the target area by using ELMo model and Word2vec model respectively comprises the following steps:

Sequentially performing word segmentation and word removal processing on each piece of original text data to obtain a sequence consisting of words as a text data sequence; using all text data sequences as input of ELMo models, and obtaining word vectors of each word in each text data sequence by using ELMo models;

Taking the denoising result of each image data as clean image data, and acquiring each target area and a preset number of category labels of each target area in each clean image data by using a CNN (computer numerical network) identification model;

The method comprises the steps of taking category description data corresponding to a preset number of category labels of each target area as input of a Word2vec model, obtaining Word vectors of each category label by using the Word2vec model, and taking vectors formed by the Word vectors of the preset number of category labels of each target area according to a descending order of confidence degrees of the category labels as single-mode data descriptors of each target area.

Preferably, the method for obtaining the clustering results of the word vector and the target area by adopting the clustering algorithm based on the word vector of the word and the undirected graph formed by the target area respectively comprises the following steps:

for text data, taking word vectors of words in all text data sequences as one node in a graph, taking cosine similarity between two word vectors as a similarity measurement result between two corresponding nodes, taking the graph formed by the word vectors of all words as input, and acquiring a clustering result of the word vectors by adopting an AP clustering algorithm;

For image data, each target area is taken as one node in the graph, the structural similarity between two target areas is taken as a similarity measurement result between the two corresponding nodes, the graph formed by all the target areas is taken as input, and an AP clustering algorithm is adopted to obtain a clustering result of the target areas.

Preferably, the method for determining the modal context fusion between each word and each target area based on the difference of projection results of word vectors and single-mode data descriptors on consistency matrixes under different modes and the dual-mode semantic expandable overlapping degree between the word vectors and the single-mode data descriptors comprises the following steps:

Wherein T _i,j is the dual-mode semantic expandable overlapping degree between the ith word and the jth target area, c _j is a single-mode data descriptor of the jth target area, X _i、X_j is a consistency matrix obtained by decomposing the semantic non-negative matrix where c _i、c_j is located, J (X _i,X_j) is a Jaccard coefficient between matrices X _i、X_j, Y () is a cosine similarity function, Y (c _i,c_j) is the cosine similarity between c _i、c_j, h _i、h_j is the common semantic base similarity of the ith word and the jth target area, and mu is a parameter adjustment factor;

R _ij is the modality context fusion between the ith term and the jth target area, N is the number of term vectors in the cluster where the term vector of the ith term is located, N is the nth term vector except the term vector of the ith term, M is the number of monomodal data descriptors in the cluster where the monomodal data descriptors of the jth target area are located, M is the mth monomodal data descriptors except the monomodal data descriptors of the jth target area, T _n,m is the dual-mode semantic expandable overlap between the corresponding term of the nth term vector and the corresponding target area of the mth monomodal data descriptor, ct _i (j) is the projection result of the term vector of the ith term on the consistency matrix of the semantic non-negative matrix where the monomodal data descriptors of the jth target area are located, ct _j (i) is the projection result of the monomodal data descriptors of the jth target area on the consistency matrix where the term vector of the ith term is located, and DTW is the Distance (DTW) 5629 (i) is the distance (W _i(j)、ct_j).

Preferably, the method for determining the consistent coding feature vector of each cluster based on the modal context fusion between each cluster and the elements in the cluster under another mode comprises the following steps:

taking a matrix formed by the fusion of the modal context between elements in two clusters under two modes as a consistent coding matrix between the two clusters;

The average value of all elements in each row and each column of each consistent coding matrix is used as a consistent coding row characteristic value and a consistent coding column characteristic value, the vector formed by all consistent coding row characteristic values and consistent coding column characteristic values in each consistent coding matrix is used as a consistent coding characteristic row vector and a consistent coding characteristic column vector of the consistent coding matrix, and the consistent coding characteristic row vector and the consistent coding characteristic column vector are used as a consistent coding characteristic vector.

Preferably, the method for determining the updated dynamic programming adjustment factor of each mode based on the result of contrast learning among the clusters and the consistent coding feature vector of the clusters under each mode comprises the following steps:

determining a semantic expandable cluster of each cluster in each mode based on the result of element contrast learning in all clusters in each mode;

taking the average value of all elements in any one consistent coding feature vector of a consistent coding matrix between each cluster in each mode and each cluster in another mode as a molecule;

Taking the sum of the difference value and 0.01 of the distribution variance of all elements in a consistent coding feature vector of a consistent coding matrix between each cluster, any semantic expandable cluster of each cluster and the same cluster of the other mode as a denominator;

Accumulating results of the ratio of the numerator to the denominator on the semantically expandable clusters of each cluster are used as first stable values, and accumulating results of the first stable values on all the clusters under another mode are used as unilateral consistent coding stable coefficients of each cluster;

Calculating a logarithmic function calculation result taking a natural constant as a base, wherein the absolute value of a difference value between the average value of the modal context fusions between all elements in any two clusters in each modal and all data in the other modal is a power; taking the accumulated result of the product between the DTW distance between the expansion distance sequences consisting of the preset number of minimum expansion distances of any two clusters in each mode and the calculation result on each mode as the semantic expansibility of each mode;

taking the sum of semantic expansibility and 0.01 of each mode as a denominator, and taking the ratio of the sum of all unilateral uniform coding stability coefficients of all cluster clusters under each mode to the denominator as an updated dynamic programming adjustment factor of each mode.

Preferably, the method for determining the semantic expandable cluster of each cluster in each mode based on the result of element comparison learning in all clusters in each mode comprises the following steps:

Taking all elements in each cluster in each mode as positive samples, taking all elements in the rest clusters as negative samples, taking all positive samples and negative samples as inputs, and acquiring an optimized distance between each pair of positive and negative samples in a mapping space by utilizing a CPC model;

And taking the sum of the optimized distances corresponding to all elements in each cluster and the rest of each cluster as the expansion distance between each cluster and the rest of each cluster, and taking the preset number of clusters with the minimum expansion distance between each cluster as the semantic expandable clusters of each cluster.

Preferably, the method for determining the data fusion result corresponding to each entity node in the initial knowledge graph by adopting the multi-mode fusion model based on the updated dynamic programming adjustment factors of all modes includes:

Constructing a knowledge graph based on data of any one mode as an initial knowledge graph;

taking a natural constant as a base number, taking a calculation result with an updated dynamic programming adjustment factor of each mode as an index as a numerator, taking accumulation of the numerator on all modes as a denominator, and taking the ratio of the numerator to the denominator as the attention weight of each mode;

Taking all words and all target area images in a text data sequence as input, and respectively utilizing a BERT-based text feature extraction network and a VGG image feature extraction network in a multi-mode fusion model to obtain text feature vectors and image feature vectors;

updating the text feature vector and the image feature vector based on the attention weight of each mode by using a self-attention mechanism, and carrying out maximum fusion on the updated text feature vector and the updated image feature vector to obtain a multi-mode fusion update vector;

and taking the output of the multi-mode fusion model as a data fusion result corresponding to each entity node in the initial knowledge graph.

Preferably, the method for obtaining the multi-mode knowledge graph based on the completion of the data fusion results corresponding to all the entity nodes in the initial knowledge graph comprises the following steps:

and carrying out equivalent link completion updating on the data fusion result corresponding to each entity node in the initial knowledge graph and each entity node in the initial knowledge graph by using an equivalent symbol, and updating all entity nodes in the initial knowledge graph to obtain the knowledge graph as a multi-mode knowledge graph.

The beneficial effects of the application are as follows: according to the method, the overlapping condition of semantic ranges among clustering clusters of different modes is analyzed through the clustering results of the image data and the text data sequences of the target area, and the consistency of potential representation of the data in the different modes is evaluated to construct the mode context fusion; according to the method, a consistency coding matrix of the data cluster level under two modes is constructed through the mode context fusion among the data of different modes, and a consistency coding feature vector among the cluster clusters under different modes is obtained; secondly, the optimal distance between positive and negative sample pairs among different clusters in the same mode is learned by using a contrast learning model, and the method has the advantages that the contrast learning among each cluster in different modes is utilized, the interactivity of data in different modes is improved, the maximization of mutual information is realized, and semantic expansion clusters formed by possible semantic expansion results of elements in each cluster can be screened based on the optimal distance; and the semantic expansibility of elements in each cluster under a single mode is evaluated by combining the consistent coding feature vector and the semantic expansion cluster to obtain an updated dynamic programming adjustment factor, and the influence degree of each mode data on the whole knowledge graph is evaluated when the knowledge graph is completed, so that the attention weight of each mode in the multi-mode fusion model is adaptively determined, the semantic credibility of the multi-mode data fusion result is improved, and the effect of the constructed multi-mode knowledge graph in recommending and constructing an automobile is better.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a schematic flow chart of a knowledge graph intelligent construction method based on deep learning according to an embodiment of the present application;

FIG. 2 is a flowchart of an implementation of a knowledge-graph intelligent construction method based on deep learning according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a multi-modal fusion model structure according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Referring to fig. 1, a flowchart of a knowledge graph intelligent construction method based on deep learning according to an embodiment of the application is shown, and the method includes the following steps:

Step S001, acquiring text data and image data from different data sources, respectively.

In the application, taking the construction of the multi-mode knowledge graph in the vehicle purchasing recommendation system as an example, in the process of constructing the multi-mode knowledge graph, firstly, an initial knowledge graph is constructed by utilizing single-mode data, secondly, data fusion is carried out by collecting data of different modes from a plurality of data sources, the accuracy of entity attribute analysis is improved, and the initial knowledge graph is complemented based on the multi-mode data fusion result, so that the semantic information of each entity node is more accurate, the attribute is more comprehensive, and the construction of the multi-mode knowledge graph is completed.

In the application, text data and image data are used as multi-mode data to carry out data analysis, and original text data and image data related to vehicle purchase recommendation are respectively obtained. Specifically, the automobile-related text data is obtained from a large amount of automobile knowledge, related web pages of automobile science popularization, and the text data comprises but is not limited to performance parameters, manufacturer specifications, configuration files, purchase rights and interests and the like. Next, vehicle-related image data including, but not limited to, the vehicle as a whole, vehicle parts, mechanical structures, etc., is obtained from a large number of vehicle-related web pages.

Further, for the original text data, inputting jieba word segmentation tools to perform word segmentation processing, secondly, taking the word segmentation result of each original text data as input, performing the word deactivation processing by using the existing deactivation vocabulary, and taking a sequence formed by the obtained results as a text data sequence. Wherein, the stop word removal and word segmentation are known techniques, and the specific process is not repeated. For the obtained image data, in order to eliminate the influence of image noise on the analysis result of the subsequent data, denoising each image data by using a bilateral filtering denoising algorithm, taking the denoised result as clean image data, wherein the bilateral filtering denoising algorithm is a known technology, and the specific process is not repeated.

And obtaining a preprocessed text data sequence and clean image data for subsequent analysis of semantic fusibility among different modal data.

Step S002, determining the common semantic base similarity of the elements based on the difference between the projection results of the elements in the clusters in the projection matrix corresponding to the clusters; determining dual-mode semantic expandable overlap based on context semantic information similarity and common semantic base similarity among elements in different modes; and determining the modal context fusion between each word and each target area based on the difference of projection results of elements on the consistency matrix under different modes and the expandable overlapping degree of dual-mode semantics.

Each mode data contains a large amount of entities and entity attribute information, and when multi-mode data fusion is carried out, the same-level data fusion can obtain a better data analysis result. Each text data contains a relatively clear semantic entity, or keyword, for example, the wheel hub is 20 inches, hundred kilometers of acceleration time is first, and the like; each image data contains a large number of targets, a large area of background area and other semantically abstract areas. If the whole text data and the whole image data are directly fused, a result with lower quality and low semantic interpretability can be obtained.

Further, when multi-modal data is integrated into a structural representation of an entity, if all the modalities are directly projected into a common subspace to capture a commonality representation between different modalities, specific information in each modality may be lost, which may result in poor completion of the knowledge graph. Therefore, the application considers the comparison learning among each cluster under different modes, improves the interactivity of the data of different modes, and realizes the maximization of mutual information, so that the effect of the constructed multi-mode knowledge graph when recommending to construct the automobile is better, and the implementation flow of the whole scheme is shown in figure 2.

Specifically, for any modality data, similar objects exist for each word or each target, such as an automobile hub and a blade hub, a white tire image and a golden tire image, and similar data also exist for a degree of similarity between the context information in each modality. Therefore, the method and the device perform clustering processing on the data of each mode, and reduce the influence of the mode isomerism on the data fusion efficiency when acquiring the semantic information with more comprehensive coverage area.

Since the text data contains explicit context information, the image data is relatively blurred with respect to the context information. Therefore, for any piece of clean image data, each piece of clean image data is used as input of a target recognition model, the target recognition model is output as recognition results for marking all target areas, b class labels before confidence values of each target area are stored, the size of b takes an experience value of 3, the structure of the target recognition model is CNN (Convolutional Neural Network) networks, an optimization algorithm is Adam (Adaptive moment estimation) algorithms, a loss function is a cross entropy function, training of a neural network is a known technology, and specific processes are not repeated.

For text data, all text data sequences are used as input of ELMo (Embeddings form Language Models) models, the ELMo model is utilized to obtain word vectors of each word in each text data sequence, and the ELMo model is a known technology, and the specific process is not repeated. Further, taking word vectors of words in all text data sequences as one node in a graph, taking cosine similarity between two word vectors as a similarity measurement result between two corresponding nodes, taking the graph formed by the word vectors of all words as input, adopting AP (Affinity Propagation) clustering algorithm to obtain a clustering result of the word vectors, and marking a cluster where the word vector of the i-th word is located as B _i; for image data, taking any one target area in each clean image number as one node in the graph, taking the structural similarity between two target areas as a similarity measurement result between the two corresponding nodes, taking the graph formed by all the target areas as input, and adopting an AP clustering algorithm to obtain the clustering result of the target areas, wherein the AP algorithm is a known technology, and the specific process is not repeated. The reason for such clustering is that the number of clusters cannot be preset because a large amount of data is contained in each modality data.

When multi-mode data fusion is performed, the same object should have similar semantics under different modes in an ideal state, and the data characteristics of each mode should be similar as much as possible in a common subspace constructed by the multi-mode data, namely, the potential representation of each mode is approximately consistent. And each object has different limiting conditions or can be combined with different constraint conditions to generate different semantic information under different modes, so that the consistency of potential data representation in different modes is evaluated according to the overlapping condition of semantic ranges among clusters of different modes.

Further, for any cluster, taking the vector corresponding to each element in each cluster as a row vector in the matrix, wherein if the cluster is corresponding to the text data, the vector corresponding to each element is a word vector; if the image data is a cluster corresponding to the image data, the vector corresponding to each element is a single-mode data descriptor. And secondly, taking a matrix formed by corresponding vectors of all elements in each cluster as a semantic non-negative matrix of each cluster. Secondly, taking a semantic Non-negative matrix of each cluster as input, and decomposing the semantic Non-negative matrix into a result of multiplication of a consistency matrix and a projection matrix by adopting an NMF (Non-negative Matrix Factorization) algorithm, wherein each column in the projection matrix is a projection result of each row vector in the semantic Non-negative matrix on the consistency matrix, and the NMF algorithm is a known technology and a specific process is not repeated.

Based on the above analysis, modality context fusions are constructed here for characterizing the degree of fusibility of context relationships of different entities in different modalities. Calculating the modal context fusion between the ith word and the jth target area:

Where h _i is the common semantic base similarity of the ith term, c _i is the term vector of the ith term, ct _i is the projection result of c _i in the projection matrix of the semantic non-negative matrix where c _i is located, P (c _i,ct_i) is the meaning semantic similarity of the ith word, the size of P (c _i,ct_i) is equal to the pearson correlation coefficient between c _i and ct _i, P _i,min is the minimum value of the semantic similarity of meaning for all words in cluster B _i, N is the number of word vectors in cluster B _i, N is the nth word vector in cluster B _i except for c _i, ct _n is the projection result of the nth word vector in the projection matrix, lsd (ct _i,ct_n) is the bit variance between vectors ct _i、ct_n, μ is the parametrical factor for preventing the denominator from being 0, μ takes the checked value of 0.01, the pearson correlation coefficient and the bit variance are all known techniques, and the specific process is not repeated;

T _i,j is the expandable overlapping degree of dual-mode semantics between the ith word and the jth target area, c _j is a monomodal data descriptor of the jth target area, X _i、X_j is a consistency matrix of a semantic non-negative matrix where c _i、c_j is located respectively, J (X _i,X_j) is a Jaccard coefficient between matrices X _i、X_j, Y () is a cosine similarity function, Y (c _i,c_j) is a cosine similarity between c _i、c_j, h _j is a common semantic base similarity of the jth target area, and the calculation principle of the common semantic base similarity of the target area and the word is consistent and is not repeated; the Jaccard coefficient is a known technology, and the specific process is not repeated;

R _ij is the modal context fusion between the ith word and the jth target area, M is the number of monomodal data descriptors in the cluster where c _j is located, M is the mth monomodal data descriptor except for c _j, T _n,m is the dual-mode semantic expandable overlap between the word corresponding to the nth word vector and the target area corresponding to the mth monomodal data descriptor, ct _i (j) is the projection result of c _i on the consistency matrix of the semantic non-negative matrix where c _j is located, ct _j (i) is the projection result of c _j on the consistency matrix of the semantic non-negative matrix where c _i is located, DTW () is a DTW (Dynamic Time Warping) distance function, DTW (ct _i(j),ct_j (i)) is the DTW distance between ct _i(j)、ct_j (i), the DTW distance is a known technology, and the specific process is not repeated.

The more semantic information of the ith word can represent semantic information of all word vectors in the cluster where c _i is located, the more likely the ith word is a word with stable semantics, the smaller the change of the corresponding vector before and after decomposition, the larger the value of P (c _i,ct_i), and the larger the value of P (c _i,ct_i)-P_i,min; The closer the word vector c _i is to the semantic information of the rest of word vectors in the cluster B _i, the greater the similarity between the corresponding projection results, the smaller the value of lsd (ct _i,ct_n), and the greater the value of h _i; c _i、c_j represents that the higher the probability of semantic information of the same object in different modes is, the smaller the difference of consistency matrixes of clustering clusters where c _i、c_j is located is, the larger the value of J (X _i,X_j) is, and the larger the value of Y (c _i,c_j) is; Meanwhile, the more stable the semantic information of c _i、c_j is, the smaller the influence of the semantics of the ith word and the jth target area on the adjacent data is, the closer the semantic stability is, the smaller the value of the I h _i-h_j is, and the larger the value of T _i,j is; c _i、c_j is that the higher the probability of the data characterization result of the same object under different modes is, the higher the consistency in the multi-mode decomposition subspace is, the more similar the projection results on different consistency matrixes are, the smaller the value of dtw (ct _i(j),ct_j (i)) is; That is, the greater the value of R _ij, the more similar the context information between the ith term and the jth target area, the greater the fusibility. The beneficial effect of the modality context fusion is that the influence of heterogeneous factors of the multi-modality data can be reduced by extracting the local context information of the data in different modalities, and the influence of useless components such as noise and the like on the real semantics is reduced.

And obtaining the context fusion of the modes between each word and each target area for later determining the fusion result of each data in each mode.

Step S003, determining updated dynamic programming adjustment factors of each mode based on the optimized distance between positive and negative sample pairs in the contrast learning mapping space of the mode data and the mode context fusion between different mode data.

Further, the process of constructing the initial knowledge graph based on the single mode is as follows: taking text data as an example, all text data sequences are taken as input, and a named entity recognition technology and a rule matching technology are sequentially utilized to recognize the entity and the entity attribute in each text data sequence, wherein the named entity recognition technology and the rule matching technology are known technologies, and the specific process is not repeated. And secondly, constructing triples based on the entities and the attributes in the text data sequence and the relation between the entities and the attributes, and constructing an initial knowledge graph of the vehicle purchase recommendation system based on all triples, wherein the construction of the knowledge graph is a known technology, and the specific process is not repeated.

In the process of acquiring multi-modal data from multiple data sources, the data in each modality is often dynamically updated, for example, automobile paint colors are more and more, and automobile images are also increasing in real time; the text description of the car will also increase, and correspondingly, the context information of the words in each text data sequence will also generate new semantic information. When the multi-mode data is updated in real time, the updating change on the original data can be generally considered, for example, a white automobile image updates a pink automobile image; from this text data, a 20 inch hub was updated with text data that is more pleasing to the 19 inch contoured appearance. In each cluster in each mode, a certain information expansibility exists among entities in an entity group consisting of different numbers of entities, and the expansion result often has higher similarity with the data updating result.

Specifically, for any two clusters under different modes, taking a cluster l _A of the A text data and a cluster l _B of the B image data as examples, respectively calculating the corresponding modal context fusion between any two elements in l _A、l_B, and taking a matrix constructed by the modal context fusion corresponding to all the elements as a consistent coding matrix between l _A、l_B, wherein the g-th row o in the consistent coding matrix is the modal context fusion between the g-th word vector corresponding word in l _A and the o-th single-module descriptor corresponding target area in l _B. And secondly, taking the average value of all elements in each row and each column of the consistent coding matrix as a consistent coding row characteristic value and a consistent coding column characteristic value respectively, taking vectors formed by all consistent coding row characteristic values and consistent coding column characteristic values in the consistent coding matrix as consistent coding characteristic row vectors and consistent coding characteristic column vectors of the consistent coding matrix respectively, and taking the consistent coding characteristic row vectors and the consistent coding characteristic column vectors as consistent coding characteristic vectors.

Further, all word vectors in the cluster l _A are used as positive samples, all word vectors in the rest text data clusters are used as negative samples, all positive and negative samples are used as inputs, a CPC (Contrastive Predictive Coding) model is adopted, an optimization algorithm is an Adam algorithm, a loss function is InfoNCE (Information Noise Contrastive Estimation loss), an optimized distance between each pair of positive and negative samples in a mapping space is output, training of a neural network is a known technology, and detailed processes are not repeated. And counting the sum of the optimized distances corresponding to the word vectors in each text data cluster except the cluster l _A, and taking the sum as the expansion distance between the cluster l _A and each text data cluster. And M ₂ text data clusters with the minimum expansion distance between the clusters I _A are taken as semantic expandable clusters of the clusters I _A, and the size of M ₂ is taken as a checked value of 5. It should be noted that, for the cluster of the target area, the target area in the cluster is taken as a sample, and the above steps are repeated to obtain the semantic expandable cluster of the cluster of each target area. The purpose of this is to assess the expansibility of information semantics between individual data while ensuring that each modality data minimizes intra-class variation.

Based on the analysis, an updated dynamic programming adjustment factor is constructed herein to characterize the impact of each modality data update on the multimodal data fusion results. Calculating an updated dynamic programming adjustment factor for the a-th modality:

Wherein r _a,A is a single-side consistent coding stability coefficient of an A-th cluster in a mode, N ₂ is the number of clusters corresponding to image data, B is a cluster of a B-th target area, M ₂ is the number of semantic expandable clusters of the A-th cluster, W is a W-th semantic expandable cluster of the A-th cluster, C _AB is any consistent coding feature vector of a consistent coding matrix between the A-th cluster and a cluster l _B, sigma (C _AB), The distribution variance and the mean value of elements in C _AB are respectively, sigma (C _WB) is the distribution variance of all elements in a consistent coding feature line vector of a consistent coding matrix between a W semantic expandable cluster and a cluster l _B, mu is a parameter adjustment factor and is used for preventing denominator from being 0, and the size of mu takes a tested value of 0.01;

u _a is the semantic scalability of the a-th modality, N ₃ is the number of clusters in the a-th modality, a is the a-th cluster in the a-th modality, The average value of the modal context fusion between all elements in the A-th and alpha-th clusters and all data in another mode is respectively shown, ln () is a logarithmic function based on a natural constant, D _A、D_α is an extended distance sequence consisting of the minimum M ₂ extended distances of the A-th and alpha-th clusters respectively, DTW () is a DTW (Dynamic Time Warping) distance function, DTW (D _A,D_α) is a DTW distance between D _A、D_α, the DTW distance is a known technology, and the specific process is not repeated;

V _a is the updated dynamic programming adjustment factor for the a-th modality, Δr _a,A is the sum of two single-sided consistent coding stability coefficients for the a-th cluster in the a-th modality.

Wherein the higher the fusibility between the element of the A cluster in the a-th mode and the element in the next data cluster in the other mode, the larger the value of the element in each consistent coding feature vector of the corresponding consistent coding matrix,The larger the value of (C _AB)、σ(C_WB), the smaller the optimization distance between the positive sample contained in the A-th cluster and the negative sample contained in the semantic expandable cluster in the feature space of the contrast model, the larger the similarity in data representation, the smaller the value between sigma (C _AB)、σ(C_WB), the first stable valueThe larger the value of r _a,A, the larger the value of r; the less the semantic interpretation possibility of each data in the a-th mode, the weaker the expansibility, the smaller the semantic coverage of the existing data in the a-th mode and the smaller the semantic change among different data, the closer the fusibility of elements in different clusters and the data of the other modes,The smaller the value of (c) is, The smaller the value of dtw (D _A,D_α), the smaller the value of u _a, the greater the likelihood that different clusters have the same semantic expandable cluster, the smaller the differences between the expanded distance sequences; the larger the value of V _a is, the weaker the semantic expansibility of the data under the a-th mode is, and the larger the influence of the completion result of the data under the a-th mode on the knowledge graph is.

Thus, updated dynamic programming adjustment factors of each mode are obtained and used for later determining the attention weight of each mode in the attention mechanism when the deep learning model is used for multi-mode data fusion.

Step S004, self-adaptively determining the attention weight of each mode based on the updated dynamic programming adjustment factors of all modes, and completing the construction of the multi-mode knowledge graph based on the attention weights by adopting a multi-mode fusion model.

Further, updated dynamic programming adjustment factors under all modes are calculated, a sequence formed by the updated dynamic programming adjustment factors under all modes is used as an input sequence of a self-attention mechanism in the fusion model, and attention weight of each mode data is determined based on the self-attention mechanism. Calculating the attention weight of the a-th modality:

where w _a is the attention weight of the a-th modality, V _a is the updated dynamic programming adjustment factor of the a-th modality, exp () is an exponential function based on a natural constant, and K is the number of modality types.

Further, taking all words and all target area images in a text data sequence as input, and respectively utilizing a BERT (Bidirectional Encoder Representations from Transformers) -based text feature extraction network and a VGG (Visual Geometry Group) image feature extraction network in a multi-mode fusion model to obtain text feature vectors and image feature vectors, wherein an optimization algorithm in a BERT network is Adam, and a loss function is a cross entropy loss function; the VGG network takes an Adam algorithm as an optimization algorithm and takes an MSE function as a loss function; the text feature vector and the image feature vector are respectively updated based on the attention weight of each mode by utilizing a self-attention mechanism, the updated text feature vector and the updated image feature vector are subjected to maximum fusion to obtain a multi-mode fusion update vector, the output of the multi-mode fusion update vector after passing through a classifier is used as a data fusion result corresponding to each entity node in the initial knowledge graph, the classifier is a softmax classifier, the structure of the multi-mode fusion model is shown in fig. 3, and the training of the neural network is a specific process of the known technology and is not repeated.

Further, the equivalent symbol is utilized to carry out equivalent link on the data fusion result corresponding to each entity node in the initial knowledge graph and each entity node in the initial knowledge graph to finish updating, and the knowledge graph obtained after updating all entity nodes in the initial knowledge graph is used as a multi-mode knowledge graph.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The foregoing description of the preferred embodiments of the present application is not intended to be limiting, but rather, any modifications, equivalents, improvements, etc. that fall within the principles of the present application are intended to be included within the scope of the present application.

Claims

1. The knowledge graph intelligent construction method based on deep learning is characterized by comprising the following steps of:

Respectively acquiring text data and image data from different data sources;

2. The knowledge graph intelligent construction method based on deep learning according to claim 1, wherein the method for determining the common semantic base similarity of each element based on the difference between projection results of each element in each cluster in the projection matrix corresponding to each cluster in the clustering results of each modal data is as follows:

3. The knowledge graph intelligent construction method based on deep learning according to claim 2, wherein the method for obtaining the Word vector of the Word in each text data sequence and the clean image data and the single-mode data descriptor of the target area by using ELMo model and Word2vec model respectively comprises the following steps:

4. The knowledge graph intelligent construction method based on deep learning according to claim 2, wherein the method for obtaining the clustering results of the word vector and the target area based on the word vector of the word and the undirected graph formed by the target area by adopting the clustering algorithm comprises the following steps:

5. The knowledge graph intelligent construction method based on deep learning according to claim 1, wherein the method for determining the modality context fusion between each word and each target area based on the difference of projection results of word vectors and single-modality data descriptors on consistency matrixes under different modalities and the dual-mode semantic expandable overlapping degree between the word vectors and the single-modality data descriptors is as follows:

6. The knowledge-graph intelligent construction method based on deep learning according to claim 1, wherein the method for determining the consistent coding feature vector of each cluster based on the modal context fusion between each cluster and the elements in the cluster under another modality is as follows:

7. The knowledge graph intelligent construction method based on deep learning according to claim 1, wherein the method for determining the updated dynamic programming adjustment factor of each mode based on the result of contrast learning among clusters and the consistent coding feature vector of the clusters in each mode comprises the following steps:

8. The knowledge graph intelligent construction method based on deep learning according to claim 1, wherein the method for determining the semantic expandable cluster of each cluster in each mode based on the result of element contrast learning in all clusters in each mode is as follows:

9. The intelligent knowledge graph construction method based on deep learning according to claim 1, wherein the method for determining the data fusion result corresponding to each entity node in the initial knowledge graph based on the updated dynamic programming adjustment factors of all modes by adopting the multi-mode fusion model is as follows:

10. The intelligent knowledge graph construction method based on deep learning according to claim 1, wherein the method for obtaining the multi-modal knowledge graph based on the completion of the data fusion results corresponding to all the entity nodes in the initial knowledge graph comprises the following steps: