CN111274289A

CN111274289A - Similarity calculation method based on symbol sequence

Info

Publication number: CN111274289A
Application number: CN202010052343.8A
Authority: CN
Inventors: 唐峤
Original assignee: Beijing Han Ming Qing Information Technology Co Ltd
Current assignee: Beijing Han Ming Qing Information Technology Co Ltd
Priority date: 2020-01-17
Filing date: 2020-01-17
Publication date: 2020-06-12

Abstract

The invention provides a similarity calculation method based on a symbol sequence; the calculation method measures similarity between pattern elements in a relational database using rules generated by a dictionary-based compression scheme according to contents of sequences in the database; the application range can be different attributes of one table, different tables in one database or different tables in different databases; the method is a new automatic mode matching method, and can be used in application program domains such as mode integration, data warehouse, electronic commerce, semantic query processing, semantic web and the like; the invention realizes automatic monitoring of the database, can more accurately reflect the mode expressed in the data instance, and saves the time and cost of development and maintenance.

Description

Similarity calculation method based on symbol sequence

Technical Field

The invention relates to the field of computer algorithms, in particular to a similarity calculation method based on a symbol sequence.

Background

One basic operation of manipulating database schema information is matching, which uses two schemas, which correspond semantically to each other, as inputs and generates mappings between elements of the two schemas. Matching plays a central role in many applications, such as web-oriented data integration, e-commerce, schema integration, schema evolution and migration, application evolution, data warehousing, database design, website creation and management, and component-based development.

In the field of database application technology, pattern matching plays a very important role. Several common database application domains are applied to pattern matching techniques, including pattern integration, data warehousing, e-commerce, semantic query processing, semantic web, and the like. Customizable generic implementations of pattern matching may make it easier to build applications that include automatic pattern matching. Also, a customizable generic implementation of pattern matching may be a key component in a database management model where the mappings returned by the pattern matching operation may be used as input for merging patterns or combining mappings.

Current pattern matching is typically performed manually as supported by a graphical user interface. This approach has limitations. Manually specifying pattern matches is tedious, time consuming, error prone and expensive. This is an increasing problem in view of the rapidly increasing number of data sources and e-commerce to be integrated. Furthermore, as the system processes more complex databases and applications, their schemas may become larger, which may result in an increased number of matches being performed. Therefore, there is a need for a faster, less labor intensive, automated method. Based on the above defects, the applicant considers to design a similarity calculation method based on a symbol sequence to solve the above defects.

Disclosure of Invention

In order to solve the defects of the prior art, the invention aims to provide a similarity calculation method based on a symbol sequence, which is characterized in that an important mode is firstly identified in an example of a database attribute; then generating rules according to the patterns in the examples; finally, after the rules are generated, the similarity is evaluated by comparing the similarity between the rules.

The invention provides a similarity calculation method based on a symbol sequence; the method comprises the following steps:

1) identifying important modes in a database sequence, and expressing an original input sequence according to a certain rule;

2) identifying an iterative process of all repeated patterns in an original input sequence by inputting one symbol at a time from a first symbol to a last symbol in the input sequence, and then generating a pattern rule;

3) executing a special post-processing after all data in a database sequence generate a pattern rule;

4) and comparing the rules generated by the repeated patterns to generate a similarity score, and evaluating the similarity.

Preferably, the step 1 is specifically divided into the following three steps:

1) identifying an original input sequence in a database;

2) representing an original input sequence in a system according to a certain representation rule;

3) a grammar rule for the input sequence is generated.

Preferably, the step 2 is specifically divided into the following eleven steps:

1) reading an identifier in a database sequence;

2) judging whether all grammar rules accord with the digital uniqueness in the symbol sequence of the identifier;

3) if all grammar rules accord with the digital uniqueness, judging whether all grammar rule utilization rates accord with the rule reusability;

4) if all grammar rules do not accord with the digital uniqueness, a new grammar rule is established, and then judgment is carried out, and whether the utilization rate of all new grammar rules accords with the rule reusability or not;

5) if the utilization rate of all grammar rules accords with the rule reusability, judging whether all grammar rules accord with the digital uniqueness and the rule reusability at the same time;

6) if the utilization rate of all the grammar rules does not accord with the rule reusability, expanding one grammar rule, and then judging whether all the grammar rules accord with the digital uniqueness and the rule reusability at the same time;

7) if all grammar rules can not accord with the digital uniqueness and the rule reusability at the same time, returning to the step 2;

8) if all grammar rules simultaneously accord with the digital uniqueness and the rule reusability, judging whether all identifiers in the database sequence are read;

9) if all identifiers in the database sequence have been read, the operation is finished;

10) stopping the iteration process after the work is finished;

11) if all identifiers in the database sequence cannot be read, returning to the current step 1, continuously reading another identifier in the database sequence, and continuing the iterative process.

Preferably, the step 3 is specifically divided into the following three steps:

1) separating the regular patterns between records; connecting all records in the input sequence to each other as a single continuous sequence of symbols; if a grammar rule is generated between the records, namely the grammar rule generates a cross record, the right side of the rule is divided into two rule modes, and the '\\ n' and '\ r' are used as separators;

2) a word segmentation rule mode; if any rule mode after the step 1 has a space, using a space character as a separator to separate the rule mode into several shorter rule modes;

3) if some rule patterns only have one terminator symbol left after the current execution of step 1 and step 2, then deleting the rule patterns.

Preferably, the step 4 is specifically divided into the following three steps:

1) comparing differences between two sets of input symbol sequences when generating their pattern rules;

2) defining a similarity formula:

wherein p (i) is the expansion mode of one group of two input sequence sets in the rule (i), n (i) is the reference frequency of the rule (i), the length of the mode p (i) is l (i), p (j) is the expansion mode of the other group of two input sequence sets in the rule (j), the length of the mode p (j) is l (j), p (i, j) is the displayed pattern, min (p (i, j)) is the minimum frequency of the same pattern appearing in the two groups p (i) and p (j), and l (i, j) is the length of the pattern appearing in the two groups;

3) a similarity value s is calculated indicating the similarity between the two sets of symbol sequences.

Compared with the prior art, the invention has the beneficial effects that:

(1) the method is a method for automatically monitoring the database and finding the column similarity, does not need a large amount of work of a manual database expert, and saves the time and cost for development and maintenance.

(2) Compared with a database mode comparison method, the method utilizes the data example, can more accurately reflect the mode expressed in the data, and avoids inaccurate matching caused by small mode information amount in the mode matching method.

(3) The method uses patterns identified in the data, rather than matching identical strings, to impart fuzzy matching characteristics, such that the data records are not necessarily arranged in the same order, nor are the number of data records necessarily identical, wherein data similarity is represented by a proximity value calculated by comparing rules generated by two database attributes.

Drawings

FIG. 1 is a flow chart of a method for similarity calculation based on symbol sequences according to the present invention;

fig. 2 is a flowchart of step 1 of the similarity calculation method;

FIG. 3 is a flow chart of step 2 of the similarity calculation method;

FIG. 4 is a flowchart of step 3 of the similarity calculation method;

fig. 5 is a flowchart of step 4 of the similarity calculation method.

Detailed Description

To further understand the structure, characteristics and other objects of the present invention, the following detailed description is given with reference to the accompanying preferred embodiments, which are only used to illustrate the technical solutions of the present invention and are not to limit the present invention.

First, as shown in fig. 1, fig. 1 is a flowchart of a similarity calculation method based on symbol sequences according to the present invention; the software system comprises the method and comprises the following steps: firstly, identifying important modes in a database sequence, and expressing an original input sequence according to a certain rule; then, from the first symbol to the last symbol in the input sequence, generating a pattern rule by an iterative process of inputting one symbol at a time to identify all repeated patterns in the original input sequence; secondly, after all data in a database sequence generate a pattern rule, executing a special post-processing; and finally, comparing rules generated by the repeated patterns to generate a similarity score, and evaluating the similarity.

Further, as shown in fig. 2, fig. 2 is a flowchart of step 1 of the similarity calculation method provided by the present invention; important patterns needing to be identified in the database sequence are mainly divided into a repetitive sequence and a nested repetitive sequence; table 1 shows a repetitive sequence and table 2 shows a nested repetitive sequence; of the two grammar rules generated, "S" represents the beginning of the original input pattern, the first rule beginning with "S" is always equivalent to the original input pattern. In each rule, the left side of the arrow symbol "- >" is a non-terminal, and may be extended to the right side of the arrow symbol, which is composed of a non-terminal and a terminal. The terminators cannot be expanded any more because they are the same input symbols as the original sequence. Capital letters with brackets [ ] may appear on both sides of the arrow in each rule. They are used to indicate non-terminators used in the rules. The same capital letters with square brackets represent the same object in the extended rule. When representing non-terminators, "[ ]) is used to distinguish from capital letters in the original input sequence to avoid confusion.

TABLE 1A repetitive sequence

TABLE 2A nested repetitive sequence

In addition, as shown in fig. 3, fig. 3 is a flowchart of step 2 of the similarity calculation method according to the present invention: the basic idea of said step 2 is that any pattern that occurs more than once can be replaced by a production rule that generates the pattern, and the process can continue recursively. This step is characterized by the fact that no specific data location or variable to be monitored needs to be provided, the monitoring being performed globally, not just at one specific location.

Said step 2 performs the entire pattern construction process by inputting one symbol at a time from the first symbol to the last symbol in the input sequence. The process is a bottom-up process that involves building new rules on the original input symbols or previously created rules. However, the grammar rules must always preserve two attributes: number uniqueness and rule reusability.

First, the number is unique. Assume that rule S is a top level rule reflecting the entire sequence. When a new symbol (terminator) is observed, it is first appended to the rule S, and then the newly appended symbol and its previous symbol form a new numerical map. If the new number appears elsewhere in the grammar, the first constraint is violated. In this case, a new rule must be created, with the newly created number on the right, being led by a new non-terminator. The two original digital maps are replaced by this newly created reference of the non-terminator. However, in some cases, a newly created rule does not always produce a new rule. If a new DRAM is also present to the right of the existing rule, no new rule needs to be created, since the DigRAM will be replaced by the non-terminator of the existing rule.

Second, rule reusability. First, in the generated syntax, the right side of any rule is only two symbols long, regardless of whether they are terminators or non-terminators. However, when a new symbol is appended to the top level rule, longer rules are created that may have a non-terminator symbol before the symbol, and therefore they will form a numerical map. This relationship diagram will first create a new rule in the grammar. However, if the new rule is used only once in the grammar, the rule will be deleted and the digital map to the right of the rule will be appended to the rule that generated the new rule. The reason is that the new rule is only referenced once in the entire grammar, which violates this rule utility constraint.

In addition, as shown in fig. 4, fig. 4 is a flowchart of step 3 of the similarity calculation method provided by the present invention; when the pattern comparison method is used for all data in a database column, a special post-processing step needs to be performed before comparison is performed after all pattern rules are generated. The post-treatment is divided into the following three steps:

1) separating the regular patterns between records; since all records in a column input to the system are connected to each other as a single continuous sequence of symbols, rules may be generated between the records. Since the records are separated using the box return (\ n) and new line (\ r) symbols, if any rule generates a cross-record, the mode to the right of this rule will be separated into two modes, using "\ n" and "\ r" as the separation symbols.

2) A word segmentation rule mode; if any pattern processed after step 1 still has any spaces in the pattern, the pattern should be partitioned into several shorter patterns using the space character as a delimiter.

3) Deleting the one-terminal symbol as a pattern; if some patterns have only one terminator symbol left after the first two steps have been performed, the patterns should be removed from the list.

In addition, as shown in fig. 5, fig. 5 is a flowchart of step 4 of the similarity calculation method provided by the present invention; when generating the rules for two sets of input symbol sequences, it is necessary to compare them and generate a value to indicate the similarity between the two sets of symbol sequences. The calculated values are only used to indicate similarity. The value is displayed as a percentage of 0% to 100%, where 0% represents no rule that is identical to the rule generated by the other symbol set, 100% represents no rule that is identical to the rules generated in both symbol sets, and no rule is displayed in only one symbol set, and again, the number of times the rule is reused is the same for each rule. The similarity formula is defined as follows:

the similarity value s may be expressed as follows, ranging between 0 and 1; the reason why the similarity value is used only for approximate evaluation is to calculate a value based on only the generated rule without considering the terminator in the rule (0) representing the entire input sequence. Thus, sometimes, although the rules are the same, the number of references is the same, but if the remaining terminator positions in rule (0) are different, the entire input sequence is still different. Therefore, the similarity value s can only be used as an indication.

Finally, the similarity calculation method based on the symbol sequence has the following specific technical characteristics:

1) the basic idea is to first identify important patterns in the instances of database attributes and then generate rules based on the patterns in the instances. After the rules are generated, the similarity can be evaluated by comparing the similarity between the rules;

2) a basic method for similarity comparison of symbol sequences by applying a symbol-based class compression algorithm;

3) the hierarchical non-repeated expandable symbol tree of the input symbol sequence can be generated by only one-time scanning without multiple times of scanning, so that the method is convenient and efficient;

4) the method needs 2 basic rules, digital uniqueness and rule reusability utilized by the generative pattern matching method;

5) the input symbol sequence is scanned once, and a frame mode of a hierarchical symbol tree which can be compared is formed from bottom to top.

It should be noted that the above summary and the detailed description are intended to demonstrate the practical application of the technical solutions provided by the present invention, and should not be construed as limiting the scope of the present invention. Various modifications, equivalent substitutions, or improvements may be made by those skilled in the art within the spirit and principles of the invention. The scope of the invention is to be determined by the appended claims.

Claims

1. A method for calculating similarity based on symbol sequences, the method comprising the steps of:

2. The similarity calculation method according to claim 1, wherein the step 1 is specifically divided into the following three steps:

1) identifying an original input sequence in a database;

3) a grammar rule for the input sequence is generated.

3. The similarity calculation method according to claim 1, wherein the step 2 is specifically divided into the following eleven steps:

1) reading an identifier in a database sequence;

10) stopping the iteration process after the work is finished;

4. The similarity calculation method according to claim 1, wherein the step 3 is specifically divided into the following three steps:

5. The similarity calculation method according to claim 1, wherein the step 4 is specifically divided into the following three steps:

2) defining a similarity formula: