CN104376055B - A kind of large-sized model data comparing method based on allocation methods - Google Patents
A kind of large-sized model data comparing method based on allocation methods Download PDFInfo
- Publication number
- CN104376055B CN104376055B CN201410614042.4A CN201410614042A CN104376055B CN 104376055 B CN104376055 B CN 104376055B CN 201410614042 A CN201410614042 A CN 201410614042A CN 104376055 B CN104376055 B CN 104376055B
- Authority
- CN
- China
- Prior art keywords
- burst
- record
- data
- records
- num
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of large-sized model data comparing method based on allocation methods, including following steps:Timeslicing parameters are set;All keywords of reference data sources are taken out, are arranged from small to large ord, and are deposited into keyword array;The record number in burst number fragment_num and each burst is calculated, then the head and the tail key value of each burst is sequentially obtained from keyword array;Start a worker thread for each burst, each worker thread obtains corresponding data content from reference data sources and data source to be compared respectively;Each worker thread compares line by line distributes to the data content of oneself, and records difference results;After all working thread process terminates, fragment_num difference results are obtained, all differences result is converged for final difference results.The present invention is applied can increase substantially large-sized model data relative efficiency in two systems or two databases.
Description
Technical field
The present invention relates to a kind of large-sized model data comparing method based on allocation methods, belong to Automation of Electric Systems distribution
Pessimistic concurrency control administrative skill field.
Background technology
Electricity distribution network model data volume is than larger, and the record number of a model table is likely to reach million ranks.For this
The table of the order of magnitude is planted, traditional single workflow manner of comparison there may be the problem of comparison procedure is time-consuming longer.
The content of the invention
In view of the deficienciess of the prior art, being applied it is an object of the present invention to provide one kind in two systems or two databases
In can increase substantially the large-sized model data comparing method based on allocation methods of large-sized model data relative efficiency.
To achieve these goals, the present invention is to realize by the following technical solutions:
A kind of large-sized model data comparing method based on allocation methods of the present invention, specifically includes following steps:
(1) timeslicing parameters are set, and the timeslicing parameters support two kinds of set-up modes:Set and big by data block by record number
It is small to set;If timeslicing parameters are set to by data block size, if data block size is m, if every record in data source to be compared
Length be k, if the record number up to n that each burst is included, then can obtain n=m/k;If timeslicing parameters are set to by note
Number is recorded, then n is most record numbers that each burst is included;
(2) all keywords of reference data sources are taken out, are arranged from small to large ord, and are deposited to keyword array
In, the keyword array size is the total number of records record_sum in the reference data sources;
(3) the record number in burst number fragment_num and each burst is calculated, then sequentially from keyword array
The middle head and the tail key value for obtaining each burst, that is, obtain burst information;
Fragment_num=record_sum/n+ (record_sum%n!=0)
If total number of records record_sum is n integral multiple, record number is n in each burst obtained;
If total number of records record_sum is not n integral multiple, in preceding fragment_num-1 burst, each burst
Record number be n, the distribution of remaining record number is in last burst;
(4) a worker thread is started for each burst, according to corresponding burst information, each worker thread is respectively from base
Corresponding data content is obtained in quasi- data source and data source to be compared;
(5) each worker thread is relatively distributed to the data content of oneself by domain line by line, and records difference results;
(6) after all working thread process terminates, fragment_num difference results are obtained, all differences result is pressed
Keyword converges for a result, as final difference results from small to large.
Above-mentioned difference results are described comprising differentiated identification and difference content;The differentiated identification includes insertion, renewal, deletion
Three kinds of marks;If certain records and has in nothing in data source to be compared, reference data sources, then the differentiated identification is insertion;If certain
Record has in data source to be compared, nothing in reference data sources, then the differentiated identification is deletion;If certain record is in data to be compared
Keyword is consistent in source and reference data sources, but content is inconsistent, then the differentiated identification is renewal;On the basis of difference content description
Corresponding data record in data source and data source to be compared.
There is provided set by record number and set both timeslicing parameters set-up modes to protect by data block size in the present invention
The flexibility of burst is hindered;The division of each burst is carried out according to keyword, the non-intersect property and integrality of burst has been ensured, from
And also just ensured irredundant and difference results the integrality of comparison procedure;Multiple worker threads are according to respective burst information
It is read out data content simultaneously and compares, work will be compared and concurrently carry out improving overall relative efficiency;Use difference
Mark and difference content description record the difference of data source record to be compared and reference data source record, so that according to difference results
It can easily organize out to need synchronous SQL statement.
Brief description of the drawings
Fig. 1 is the large-sized model data comparing method workflow diagram based on allocation methods of the invention.
Embodiment
To be easy to understand the technical means, the inventive features, the objects and the advantages of the present invention, with reference to
Embodiment, is expanded on further the present invention.
A kind of large-sized model data comparing method based on allocation methods of the present invention.Model table in distribution network system is general
There is keyword, which provides the possibility for according to keywords carrying out burst comparison.The mould more present invention is generally directed to record number
Type tables of data, is arranged as required to timeslicing parameters, is obtained further according to timeslicing parameters from reference data sources and data source to be compared
Burst content, and multiple bursts are compared simultaneously, finally obtain difference results.Difference results are retouched by differentiated identification and difference content
Composition is stated, differentiated identification has insertion, updates, deletes these three marks, and difference content is described as reference data sources and number to be compared
According to the content information of source respective record.Difference results are generated according to reference data sources for data source to be compared.
Referring to Fig. 1, this method specifically includes following steps:
(1) specify and compare data source and model table to be compared, the type that data source is supported has database and data file
Deng needing keyword in model table.Timeslicing parameters are arranged as required to, can set and also be set by data block size by record number
Put.
If timeslicing parameters are set to by data block size, it is assumed that set the length that data block size is every record in m, the table
The record number up to n included for k, corresponding each burst is spent, then can obtain n=m/k;If timeslicing parameters are set to by note
Number is recorded, then n is the numerical value that this is set.
(2) all keywords of reference data sources are obtained, and by the arrangement of ascending order, storage to keyword array
In, the array size is the total number of records record_sum in the reference data sources.
With reference to keyword array, burst information is obtained according to timeslicing parameters, including burst number, the head of each burst
Tail key value.
(3) the record number in burst number and each burst is calculated, then each point is sequentially obtained from keyword array
The head and the tail key value of piece;
Burst number fragment_num values should be:
Fragment_num=record_sum/n+ (record_sum%n!=0)
If total number of records record_sum n integral multiple, then it is n that number is recorded in each burst of acquisition;
Total number of records record_sum is not n integral multiple, then in preceding fragment_num-1 burst, each burst
Record number be n, the distribution of remaining record number is in last burst.
(4) start a worker thread for each burst, obtain reference data sources according to corresponding burst information and wait to compare
Compared with the corresponding contents of data source;
(5) each worker thread compares line by line distributes to the data content of oneself, and records difference results.Difference results
In described comprising differentiated identification and difference content, differentiated identification comprising insertion, update, delete three kinds of marks, difference content description
On the basis of in data source and data source to be compared respective record content information.
(6) after the completion for the treatment of that all working thread compares, fragment_num difference results are obtained, by all differences result
Converge for a result, as final difference results.
The present invention operation principle be:
Present invention is generally directed to the more large-sized model data of the record number of the isomorphism in different system or disparate databases
Table relatively and obtain difference results.Burst information is set according to the keyword of reference data sources, then from reference data sources with treating
Compare acquisition burst content in data source, and compare multiple bursts simultaneously, finally obtain difference results.The method of the present invention is realized
Burst comparison techniques, greatly improved the relative efficiency of the more large-sized model data of record number.
There is provided set and be arranged on by data block size the burst ensured to a certain extent by record number in the present invention
Flexibility.The division of each burst is carried out according to keyword, the non-intersect property and integrality of burst has been ensured, so as to also just protect
Irredundant and difference results the integrality of comparison procedure is hindered.Multiple worker threads are carried out simultaneously according to respective burst information
Read data content and compare, work will be compared and concurrently carry out improving overall relative efficiency.Using in differentiated identification and difference
Hold the difference that description records data source record to be compared and reference data source record, so as to can easily be organized according to difference results
Go out to need synchronous SQL statement.
Using the method for the present invention, large-sized model table is compared using allocation methods, relative efficiency can be greatly improved.
In the case of not considering machine performance and resource occupation, burst number is close to speed higher than burst premise after burst.
The general principle and principal character and advantages of the present invention of the present invention has been shown and described above.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and the simply explanation described in above-described embodiment and specification is originally
The principle of invention, without departing from the spirit and scope of the present invention, various changes and modifications of the present invention are possible, these changes
Change and improvement all fall within the protetion scope of the claimed invention.The claimed scope of the invention by appended claims and its
Equivalent thereof.
Claims (2)
1. a kind of large-sized model data comparing method based on allocation methods, it is characterised in that specifically include following steps:
(1) timeslicing parameters are set, and the timeslicing parameters support two kinds of set-up modes:Set and set by data block size by record number
Put;
If timeslicing parameters are set to by data block size, if data block size is m, if the length that every records in data source to be compared
Spend for k, if the record number up to n that each burst is included, then can obtain n=m/k;
If timeslicing parameters are set to by record number, n is most record numbers that each burst is included;
The n of two kinds of set-up modes value obtains integer value by the method that truncates;
(2) all keywords of reference data sources are taken out, are arranged from small to large ord, and are deposited into keyword array, institute
It is the total number of records record_sum in the reference data sources to state keyword array size;
(3) the record number in burst number fragment_num and each burst is calculated, then sequentially obtain from keyword array
The head and the tail key value of each burst is taken, that is, obtains burst information;
If total number of records record_sum is n integral multiple, record number is n, the calculating side of burst number in each burst
Method is:Fragment_num=record_sum/n;
If total number of records record_sum is not n integral multiple, in preceding fragment_num-1 burst, the note of each burst
Record number is n, and remaining record number distribution is in last burst, and the computational methods of burst number are:
Fragment_num=record_sum/n+1, wherein record_sum/n value obtain respective integer value by the method that truncates;
(4) a worker thread is started for each burst, according to corresponding burst information, each worker thread is respectively from base value
According to obtaining corresponding data content in source and data source to be compared;
(5) each worker thread is relatively distributed to the data content of oneself by domain line by line, and records difference results;
(6) after all working thread process terminates, fragment_num difference results are obtained, by all differences result by key
Word converges for a result, as final difference results from small to large.
2. the large-sized model data comparing method according to claim 1 based on allocation methods, it is characterised in that
The difference results are described comprising differentiated identification and difference content;
The differentiated identification includes insertion, renewal, three kinds of marks of deletion;If certain records nothing in data source to be compared, base value
According to having in source, then the differentiated identification is insertion;If certain records and has in data source to be compared, nothing in reference data sources, then the difference
It is different to be designated deletion;If certain record keyword in data source to be compared and reference data sources is consistent, but content is inconsistent, then
The differentiated identification is renewal;
Difference content is described as corresponding data record in reference data sources and data source to be compared.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614042.4A CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410614042.4A CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104376055A CN104376055A (en) | 2015-02-25 |
CN104376055B true CN104376055B (en) | 2017-08-29 |
Family
ID=52554962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410614042.4A Active CN104376055B (en) | 2014-11-04 | 2014-11-04 | A kind of large-sized model data comparing method based on allocation methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104376055B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106033427A (en) * | 2015-03-11 | 2016-10-19 | 阿里巴巴集团控股有限公司 | A sampling data verification method and device |
CN105843886A (en) * | 2016-03-21 | 2016-08-10 | 国电南瑞科技股份有限公司 | Multi-thread based power grid offline model data query method |
CN106777337A (en) * | 2017-01-13 | 2017-05-31 | 山东浪潮商用系统有限公司 | The management method of data model |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652116A (en) * | 2005-03-29 | 2005-08-10 | 威盛电子股份有限公司 | Database synchronous system and method |
CN101236554A (en) * | 2007-11-29 | 2008-08-06 | 中兴通讯股份有限公司 | Database mass data comparison process |
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1708096A1 (en) * | 2005-03-31 | 2006-10-04 | Ubs Ag | Computer Network System and Method for the Synchronisation of a Second Database with a First Database |
-
2014
- 2014-11-04 CN CN201410614042.4A patent/CN104376055B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1652116A (en) * | 2005-03-29 | 2005-08-10 | 威盛电子股份有限公司 | Database synchronous system and method |
CN101236554A (en) * | 2007-11-29 | 2008-08-06 | 中兴通讯股份有限公司 | Database mass data comparison process |
CN102467570A (en) * | 2010-11-17 | 2012-05-23 | 日电(中国)有限公司 | Connection query system and method for distributed data warehouse |
Also Published As
Publication number | Publication date |
---|---|
CN104376055A (en) | 2015-02-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101329676B (en) | Data paralleling abstracting method and apparatus and database system | |
CN103810224B (en) | information persistence and query method and device | |
US9195701B2 (en) | System and method for flexible distributed massively parallel processing (MPP) database | |
CN102270225A (en) | Data change log monitoring method and device | |
CN110209728A (en) | A kind of Distributed Heterogeneous Database synchronous method, electronic equipment and storage medium | |
CN104376055B (en) | A kind of large-sized model data comparing method based on allocation methods | |
CN104268298A (en) | Method for creating database index and inquiring data | |
WO2019228015A1 (en) | Index creating method and apparatus based on nosql database of mobile terminal | |
CN103226610B (en) | Database table querying method and device | |
CN104715076B (en) | A kind of data processing of multithread and device | |
CN103780263B (en) | Device and method of data compression and recording medium | |
US9262472B2 (en) | Concatenation for relations | |
CN103365923A (en) | Method and device for assessing partition schemes of database | |
CN106897281A (en) | A kind of daily record sharding method and device | |
CN104298570B (en) | Data processing method and device | |
CN106682047A (en) | Method for importing data and related device | |
JP2017537383A5 (en) | ||
CN104572730A (en) | Method and device for importing and exporting digital resources | |
CN106156197A (en) | The querying method of a kind of data base and device | |
CN106776810A (en) | The data handling system and method for a kind of big data | |
US9135300B1 (en) | Efficient sampling with replacement | |
US9378229B1 (en) | Index selection based on a compressed workload | |
CN104461552B (en) | The analytic method and resolver of bar code attribute | |
KR20160047239A (en) | The column group selection method for storing datea efficiently in the mixed olap/oltp workload environment | |
CN105786916B (en) | A kind of storage method and system of the gradation directory based on large capacity table |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |