CN111221916A

CN111221916A - Entity contact graph (ERD) generating method and device

Info

Publication number: CN111221916A
Application number: CN201910951352.8A
Authority: CN
Inventors: 王新义
Original assignee: Shanghai Yixun Information Technology Co Ltd
Current assignee: Shanghai Yixun Information Technology Co Ltd
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2020-06-02

Abstract

The invention discloses an entity contact map (ERD) map generation method, which comprises the steps of carrying out word segmentation on physical names and physical names of attributes of entities in a database, carrying out synonym combination on prototype words corresponding to words subjected to word segmentation to generate normalized words, generating vectors corresponding to the entities and the attributes according to word frequency attributes of the words in a normalized word set, determining similarity between the entities and the attributes according to the vectors, and finally generating an ERD map corresponding to the database according to the similarity. According to the technical scheme, the accurate ERD graph can be generated rapidly on the premise of not depending on the foreign key, so that the adaptability and the efficiency of reverse work are improved.

Description

Entity contact graph (ERD) generating method and device

Technical Field

The invention relates to the technical field of communication, in particular to a method for generating an entity contact diagram (ERD) diagram. The invention also relates to an ERD graph generating device.

Background

The ERD graph (Entity Relationship graph) provides a method for representing Entity types, attributes and relationships, and is used to describe a conceptual model of the real world. The ERD graph is a data model or schema graph used for high level description of conceptual data models that provides graphical symbols for representing such data models in the form of entity contact schema graphs. The basic elements constituting the ERD graph are as follows:

entity (Entity): things that can be objectively distinguished from each other are entities, which can be concrete people and things, and also can be abstract concepts and connections. The key is that one entity can be distinguished from another entity, and entities with the same attributes have the same characteristics and properties. The entity name and its attribute name set are used to abstract and characterize the same kind of entity. The ERD graph is represented by rectangles, and entity names are written in the rectangular boxes.

Attribute (Property Attributes): an entity has a certain property, which can be characterized by several attributes. Attributes cannot be separated from an entity, and attributes are relative to an entity.

Relationship (Relationship): the information world reflects the association inside or between the entities. An intra-entity association generally refers to an association between the attributes that make up an entity; an association between entities generally refers to an association between different sets of entities. In the ERD diagram, the contact names are indicated by diamonds, and are respectively connected with related entities by non-directional edges, and the types of the contact (1:1, 1: n or m: n) are marked at the non-directional edges.

ERD graphs are used in the database field primarily to help collect requirements, database design, database debugging, and data integration based on three basic elements, entities, attributes, and relationships. At present, in a Database design process, a CDM (conceptual Data Model) or a PDM (Physical Data Model) entity corresponds to a DBMS (Database Management System) table, and a CDM or PDM relationship (one-to-one, one-to-many, or many-to-many) is mapped to a DBMS Foreign Key (Foreign Key), so as to express a relationship between entities through the Foreign Key. If the DBMS database table has foreign key definitions and foreign key constraints, the ERD graph can be generated reversely by a DDL (Data Definition Language) statement. More database tools generate ERD graphs by foreign key retrograde.

The inventor finds that the existing DBMS database reverse production ERD graph depends on foreign keys, and the entity relationship cannot be generated without the foreign keys. In some current environments (internet environments), due to the consideration of cost and equipment performance, the DBMS design mode does not recommend foreign key constraints, so many DBMS tables do not have foreign keys; in addition, the Data warehouse of the current ODS (Operational Data Store) does not recommend a foreign key, and the ERD graph cannot be completely reversed.

Therefore, how to generate the ERD graph based on the database without depending on the foreign key becomes a technical problem to be solved urgently by those skilled in the art.

Disclosure of Invention

Aiming at generating an ERD (error resolution) graph only depending on a foreign key in the prior art, the invention provides an ERD graph generating method of an entity contact graph, which comprises the following steps:

carrying out word segmentation on the physical names of the entities and the physical names of the attributes in the database;

carrying out synonym combination on prototype words corresponding to the words after word segmentation processing to generate normalized words;

generating vectors corresponding to the entities and the attributes according to the word frequency attribute value and the inverse text frequency attribute value of each word in the normalized word set;

determining similarity between each entity and each attribute according to the vector;

and generating an ERD graph corresponding to the database according to the similarity.

Preferably, the generating of the vector corresponding to each entity and each attribute by the word frequency attribute and the inverse text frequency attribute of each word in the normalized word set further includes:

dividing the normalized word set into an entity set corresponding to the entity and an attribute set corresponding to the attribute;

determining a weight value of each word in the entity set and the attribute set, wherein the weight value is generated according to the word frequency attribute value and the inverse text frequency attribute value;

generating a vector corresponding to each entity according to the number of words and the weight value of each entity corresponding to each entity in the entity set, and generating a vector corresponding to each attribute according to the number of words and the weight value of each attribute corresponding to each attribute in the attribute set.

Preferably, generating a vector corresponding to each of the entities according to the number of words corresponding to each of the entities in the entity set and the weight value, further includes:

and respectively mapping the corresponding words of the entities in the entity set into sub-vectors through a word vector neural network model, and generating the vectors according to the sub-vectors, the weight values and the quantity.

Preferably, generating a vector corresponding to each attribute according to the number of words corresponding to each attribute in the attribute set and the weight value, further includes:

and respectively mapping the words corresponding to the attributes in the attribute set into sub-vectors through a word vector neural network model, and generating the vectors according to the sub-vectors, the weight values and the quantity.

Preferably, determining a similarity between each entity and each attribute according to the vector further includes:

obtaining a cosine value of an included angle between a vector corresponding to each entity and a vector corresponding to each attribute;

and generating a similarity matrix of each entity and each attribute according to the cosine value of the included angle.

Correspondingly, the invention also provides an entity contact map ERD map generation device, which comprises:

the processing module is used for carrying out word segmentation processing on the physical names of the entities and the physical names of the attributes in the database;

the first generation module is used for carrying out synonym combination on the prototype word corresponding to the word after the word segmentation processing so as to generate a normalized word;

a second generation module, configured to generate vectors corresponding to the entities and the attributes according to the word frequency attribute value and the inverse text frequency attribute value of each word in the normalized word set;

a determining module, configured to determine, according to the vector, a similarity between each entity and each attribute;

and the third generation module is used for generating an ERD (error correction display) graph corresponding to the database according to the similarity.

Preferably, the second generating module is specifically configured to:

Preferably, the second generating module is further specifically configured to:

Preferably, the determining module is specifically configured to:

Therefore, by applying the technical scheme of the application, the method carries out word segmentation on the physical names and the physical names of the attributes of the entities in the database, carries out synonym combination on prototype words corresponding to the words after word segmentation to generate normalized words, generates vectors corresponding to the entities and the attributes according to the word frequency attributes of the words in the normalized word set, determines the similarity between the entities and the attributes according to the vectors, and finally generates the ERD graph corresponding to the database according to the similarity. According to the technical scheme, the accurate ERD graph can be generated rapidly on the premise of not depending on the foreign key, so that the adaptability and the efficiency of reverse work are improved.

Drawings

Fig. 1 is a schematic flowchart of an entity contact map ERD graph generation method disclosed in an embodiment of the present application;

fig. 2 is a similarity matrix chart in an entity contact map ERD chart generation method disclosed in the embodiment of the present application;

FIG. 3 is a schematic diagram of a word segmentation processing procedure in an entity contact diagram ERD graph generation method disclosed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a segmentation processing result in an entity contact diagram ERD graph generation method disclosed in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a device of an entity contact map ERD map generation method according to an embodiment of the present invention.

Detailed Description

As described in the background, the ERD graph is generally generated by using the foreign key in the prior art, but the foreign key also has various disadvantages, such as that the database needs to maintain the internal management of the foreign key, the foreign key is equal to the implementation of a consistent transaction of data (meaning that the database must be in a consistent state before and after the execution of a transaction), all the foreign key is handed to the database server for completion, when some addition and deletion are performed on fields of the foreign key, after an update operation, relevant operations need to be triggered for checking, server resources (CPU, memory and the like) have to be consumed, and the foreign key is easy to generate deadlock situations because a request is required to lock other tables. Therefore, the foreign key is not used in some environments or industries, especially in the internet industry, the user quantity is large, the concurrency is high, and therefore, the database server is easy to become a performance bottleneck, is limited by the IO capacity and cannot be expanded horizontally easily.

Therefore, aiming at the condition that the ERD graph cannot be generated through the foreign key, the application discloses an ERD graph generation method which is used for rapidly generating an accurate ERD graph on the premise of not depending on the foreign key, so that the adaptability and the efficiency of reverse work are improved. As shown in fig. 1, the method comprises the following steps:

and 101, performing word segmentation on the physical names of the entities and the physical names of the attributes in the database.

The data in the database is used for generating an ERD graph by representing the relationship between entities and attributes, wherein the entities refer to things which can be objectively distinguished from each other, the entities can be specific people and things, and also can be abstract concepts and connections, the key point is that one entity can be distinguished from another entity, the entities with the same attributes have the same characteristics and properties, the attributes are certain characteristics of the entities, the attributes cannot be separated from the entities, and the attributes are relative to the entities.

In order to enable a computer to recognize and understand the meaning of the physical names of the entities and the attributes, the physical names of the entities and the attributes need to be subjected to word segmentation, which means that the physical names of the entities and the attributes are converted into expressions of words.

In a specific application scenario, the technical scheme of the application is to divide a long word/long sentence into words according to a dictionary, and then to find an optimal combination mode of the words, including a maximum matching word division algorithm, a shortest path word division algorithm, a word division algorithm based on an N-Gram model and the like, or word division based on characters, i.e., forming words by characters, dividing the sentence into one character, combining the characters into words, and finding an optimal division strategy, and meanwhile, converting the optimal division strategy into a sequence labeling problem, including a generating model word division algorithm, a discriminating model word division algorithm, a neural network word division algorithm and the like. On the basis of being capable of cutting long words/long sentences into words and searching the best combination mode of the words, the word segmentation modes belong to the protection scope of the application.

And 102, carrying out synonym combination on the prototype word corresponding to the word after the word segmentation processing to generate a normalized word.

In a preferred embodiment of the present application, in order to implement resource sharing of the participle corpus, it is necessary to perform normalization processing on the participled words, that is, to unify the participled words (complex numbers, past expressions, etc.) into word prototypes, and to merge synonyms of the word prototypes based on any one or a combination of a dictionary method, a synonym dictionary, and others. In a specific application scenario, a thesaurus dictionary may be included in the system, and synonym merging may be performed on normalized word prototypes according to the division of each synonym by the thesaurus dictionary.

And 103, generating vectors corresponding to the entities and the attributes according to the word frequency attribute value and the inverse text frequency attribute value of each word in the normalized word set.

TF-IDF (Term Frequency-inverse document Frequency) is a commonly used weighting technique for information retrieval and data mining, and TF means Term Frequency (Term Frequency) which refers to the Frequency with which a given word appears in the file. This number is a normalization of the number of words (termcount) to prevent it from biasing towards long files. IDF means Inverse text Frequency index (Inverse Document Frequency), which is a measure of the general importance of a word. The IDF of a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and taking the obtained quotient as a base-10 logarithm, which is mainly used to evaluate the importance of a word to one of the documents in a corpus or a corpus, i.e. to calculate the importance of a word.

In a preferred embodiment of the present application, to obtain the important values of the physical names of the entities and the attributes in the corresponding sets, specifically, the normalized word set is divided into an entity set corresponding to the entities and an attribute set corresponding to the attributes, and a weight value of each word in each entity set and each attribute set is calculated, where the calculation of the weight value is generated according to a word frequency attribute value and an inverse text frequency attribute value.

Further, in the preferred embodiment of the present application, in order to better calculate the relationship between the entities and the attributes so as to generate the ERD graph, the present application adopts a method of converting the physical names of the entities and the physical names of the attributes into vectors to show the relationship between the entities and the attributes, and generates the vectors corresponding to the entities according to the number of words and the weight values of the entities in the entity set, specifically, Word2vector (correlation model for generating Word vector) is used to map the corresponding words of each entity in the entity set into sub-vectors, and then generates the entity vectors according to the sub-vectors, the weight values, and the number of words, wherein Word2vector is a group of correlation models for generating Word vector words, and the models are shallow and double-layer neural networks, the Word2vector model can be used to map each Word to one vector, can be used to represent the relationship between Word-to-Word, and the Word vector has good semantic properties, different syntactic and semantic features of a word can be distributed to each of its dimensions for representation.

It should be noted that although the above embodiments have been described with specific models and mapping manners, the types of the relevant models and mapping manners are not limited thereto, and any changes that can be made by those skilled in the art should fall within the scope of the present application,

and S104, determining the similarity between each entity and each attribute according to the vector.

In the preferred embodiment of the present application, in order to obtain the relationship between the entity and the attribute more conveniently, specifically, the cosine similarity of the vectors corresponding to the entity and the attribute is calculated to obtain the relationship between the entity and the attribute, but the obtaining of the similarity between the vectors in the present application is not limited to obtaining by calculating the cosine similarity, and any alternative conceivable by those skilled in the art should be included in the protection scope of the present application.

The cosine similarity measures the similarity between two vectors by measuring the cosine value of the included angle of the two vectors, and the result is irrelevant to the length of the vectors and only relevant to the pointing direction of the vectors, and the similarity condition of two texts can be known by calculating the cosine values of the two vectors.

Specifically, an included angle cosine value between a vector corresponding to each entity and a vector corresponding to each attribute is obtained, the included angle cosine value is used to evaluate the similarity between the vector corresponding to the entity and the vector corresponding to the attribute, and then a similarity matrix between the entity and the attribute is generated according to the included angle cosine value, as shown in fig. 2.

And S105, generating an ERD graph corresponding to the database according to the similarity.

In the preferred embodiment of the present application, the similarity between the entity and the attribute is obtained through the above operations, and finally, the ERD graph showing the entity relationship model, which is generated without depending on the foreign key, can be obtained according to the similarity.

By applying the technical scheme of the application, the physical names of the entities and the physical names of the attributes in the database are subjected to word segmentation; carrying out synonym combination on prototype words corresponding to the words after word segmentation processing to generate normalized words; generating vectors corresponding to the entities and the attributes according to the word frequency attributes of the words in the normalized word set; determining the similarity between the entity and the attribute according to the vector; and finally, generating an ERD graph corresponding to the database according to the similarity. The entity contact diagram ERD diagram generation method provided by the application does not depend on foreign keys, and adaptability and efficiency of reverse work are improved.

In order to further illustrate the technical idea of the present invention, the technical solution of the present invention will now be described with reference to specific application scenarios.

Step 201, performing word segmentation processing on the physical names of the entities and the physical names of the attributes in the database.

In a specific application scenario, a DDL statement of the DBMS (input of Tokenizer/Parser) is input for performing the word segmentation processing, the process of performing the word segmentation processing is shown in fig. 3, and the result after the word segmentation is shown in fig. 4.

Step 202, carrying out synonym combination on the word prototype corresponding to the obtained word after the word segmentation processing so as to generate a normalized word.

In a specific application scenario, word (word: complex, past) after word segmentation is normalized into word prototypes and combined based on a dictionary mode, in a specific application scenario, a synonym dictionary can be recorded into a system, and synonym combination is performed on the normalized word prototypes according to division of each synonym by the synonym dictionary.

Step 203, generating vectors corresponding to each entity and each attribute according to the word frequency attribute value and the inverse text frequency attribute value of each word in the normalized word set.

In a specific application scenario, the normalized word set is divided into an entity set corresponding to an entity and an attribute set corresponding to an attribute.

TF-IDF (Term Frequency-inverse document Frequency) is a commonly used weighting technique for information retrieval and data mining, where TF is the Term Frequency (Term Frequency) that refers to the Frequency of occurrence of a given word in the document, and this number is a normalization of the number of words to prevent the given word from being biased toward long documents (the same word may have a higher number of words in a long document than in a short document, regardless of the importance of the word). For words in a particular document, their importance may be expressed as:

in the above formula, the numerator is the number of times the word appears in the document, and the denominator is the sum of the numbers of the words appearing in all the words.

The IDF is an Inverse text Frequency index (Inverse Document Frequency) and is a measure of the general importance of a term, and the IDF of a specific term can be obtained by taking the number of documents containing terms in the total Document number and taking the obtained quotient as a logarithm with the base 10, and the formula is as follows:

and calculates the product of TF and IDF: tfidf_i，j＝tf_i，j×idf_j

TF-IDF tends to filter out common words, leaving important words.

Furthermore, vectors corresponding to the entities and the attributes are generated according to the number of words corresponding to the entities and the attributes in the entity set and the attribute set of the entities and the attributes and the weight values.

In a specific application scenario, word2vector is used to calculate vectorization of an entity and an attribute respectively, and a digitized vector is used to represent the entity and the attribute, where word2vector is a group of related models used to generate word vectors, these models are shallow and double-layer neural networks used to train to reconstruct word text of linguistics, the order of words is unimportant under the assumption of bag-of-words model in word2ector, the model can be used to map each word to a vector, and can be used to represent the relationship between words and words, since word vectors have good semantic properties, and the value of each dimension of a word vector can represent a feature having certain semantic and grammatical interpretation, a word vector is a common way to represent word features, and the calculation formula of the volume vector is as follows:

wherein n is the number of words corresponding to the entity, TFIDF_iThe value of TF-IDF, WORDVEC, being the ith word of the entity name_iA word vector for the ith word of the entity name.

The calculation formula of the attribute vector is as follows:

wherein n is the number of words corresponding to the attribute name, TFIDF_iThe value of TF-IDF as the ith word of the attribute name, WORDVEC_iA word vector for the attribute name ith word.

And step 204, determining the similarity between the vectors according to the cosine values of the included angles of the vectors.

In a specific application scenario, the similarity between vectors is determined by calculating cosine values of an included angle between the two vectors, and the cosine similarity measures the similarity between the two vectors by measuring the cosine values of the included angle between the two vectors. The cosine value of the 0-degree angle is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions. The result is independent of the length of the vector, only the pointing direction of the vector. Cosine similarity is generally used in the positive space, and therefore, given values are between 0 and 1, that is, the similarity between each entity and each attribute, and the similarity calculation formula is as follows:

then, a similarity matrix may be obtained from the result of the similarity calculation, for example, fig. 2.

And step 205, generating an ERD graph corresponding to the database according to the similarity matrix.

In order to achieve the above object, the present invention further provides an entity contact map ERD map generating device, including:

Preferably, the second generating module is specifically configured to:

Preferably, the second generating module is further specifically configured to:

Preferably, the determining module is specifically configured to:

Through the above description of the embodiments, those skilled in the art will clearly understand that the present invention may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present invention.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned invention numbers are merely for description and do not represent the merits of the implementation scenarios.

The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. An entity contact graph (ERD) graph generation method is characterized by comprising the following steps:

2. The method of claim 1, wherein generating vectors corresponding to each of the entities and each of the attributes according to the word frequency attribute and the inverse text frequency attribute of each word in the normalized word set comprises:

3. The method of claim 2, wherein generating a vector corresponding to each of the entities according to the number of words and the weight value corresponding to each of the entities in the entity set comprises:

4. The method according to claim 2, wherein generating a vector corresponding to each of the attributes according to the number of words corresponding to each of the attributes in the attribute set and the weight value comprises:

5. The method according to any of claims 1-4, wherein the similarity between each of said entities and each of said attributes is determined from said vector, in particular:

6. An entity contact map (ERD) graph generation device is characterized by comprising:

7. The device of claim 6, wherein the second generation module is specifically configured to:

8. The device of claim 7, wherein the second generation module is further specifically configured to:

9. The device of claim 7, wherein the second generation module is further specifically configured to:

10. The device according to any one of claims 6 to 9, wherein the determining module is specifically configured to: