[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN105512264B - The performance prediction method that concurrent efforts load in distributed data base - Google Patents

The performance prediction method that concurrent efforts load in distributed data base Download PDF

Info

Publication number
CN105512264B
CN105512264B CN201510881758.5A CN201510881758A CN105512264B CN 105512264 B CN105512264 B CN 105512264B CN 201510881758 A CN201510881758 A CN 201510881758A CN 105512264 B CN105512264 B CN 105512264B
Authority
CN
China
Prior art keywords
inquiry
regression model
data base
network transmission
distributed data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510881758.5A
Other languages
Chinese (zh)
Other versions
CN105512264A (en
Inventor
李晖
陈梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guizhou Youlian Borui Technology Co Ltd
Guizhou University
Original Assignee
Guizhou Youlian Borui Technology Co Ltd
Guizhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guizhou Youlian Borui Technology Co Ltd, Guizhou University filed Critical Guizhou Youlian Borui Technology Co Ltd
Priority to CN201510881758.5A priority Critical patent/CN105512264B/en
Publication of CN105512264A publication Critical patent/CN105512264A/en
Application granted granted Critical
Publication of CN105512264B publication Critical patent/CN105512264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses the performance prediction methods that concurrent efforts in a kind of distributed data base load, establish linear regression model (LRM), for judging the interaction in distributed data base between inquiry, and the inquiry time delay L in prediction distribution formula Database Systems under different degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its key step includes: A, the metric selection of inquiry time delay L;B, linear regression model (LRM) is established in the interaction under inquiry combination complications;C, the correctness and validity of experimental demonstration linear regression model (LRM).By repeatedly experiments have shown that inquiry time delay, it is respectively 14% that network delay and I/O block, which read the total average relative error of number, 30% and 37%, from experimental result it can be seen that linear regression model (LRM) proposed by the present invention can be very good to carry out distributed data base the responsible performance prediction of concurrent efforts, consequently facilitating the subsequent task distribution of database, can shorten the average latency of inquiry.

Description

The performance prediction method that concurrent efforts load in distributed data base
Technical field
The present invention relates to a kind of performance prediction method of workload in database, in especially a kind of distributed data base The performance prediction method of concurrent efforts load.
Background technique
Now currently, the performance prediction for database work load has had relevant research.But study now Database is limited only to the database of single node, that is to say, that the database only has a server, for a server How is its performance, depends primarily on the disk of server and the utilization rate of CPU.The data volume generated with research and industrial circle Growth, distributed data base system is applied to PB grades of data of storage and management, and provides high concurrency and scalability.Point Data in cloth database pass through dispersion/collection call by pattern to processing.For example, inquiry can be more out by a node split A subquery, the execution that these subqueries can be concurrent by a lot of other nodes, then the partial results of each node will It on back to this node and is combined, obtains final result of query execution.Therefore in distributed data base, data It is on the multiple distributed nodes dividedly stored in the cluster, and this cluster can be by adding new node come very Readily it is extended.Namely why distributed data base is used to store and process one of big data for this.In general, Distributed data base is used to support concurrently to execute analytic type workload, to reduce the query execution time needed.However, concurrent There is also the challenges in terms of resource contention while bringing a large amount of advantages for execution, for example judge mutual between multiple queries Effect.Interaction between multiple queries may be different.When two, which are inquired, shares a table scan, between each other may be used It can be positive effect.On the contrary, will mutually increase because of network delay when two inquiries require high network transmission bandwidth Add the query execution time.For single-node data library, the distribution of task is limited only to a server, but for For the distributed data base of multinode, then there are many selections for the distribution of task, how to be distributed by task, to realize inquiry Average latency is shorter, is that database should consider in carry out task distribution.For example, there are 3 in distributed data base A server, and query task all is being executed, No. 1 server disk, cpu busy percentage are relatively low;No. 2 and No. 3 server disks It is relatively high with cpu busy percentage, if having carried out 1 inquiry at this time, assign them to which server is just contemplated.Such as There are also other inquiries to increase in waiting or subsequent time its disk, cpu busy percentage for No. 1 server of fruit, and under 2, No. 3 servers When one moment disk, CPU are reduced, then should just consider to assign the task to No. 3 servers of 2 goods when carrying out task distribution, and It is not allocated to No. 1 server.It is therefore desirable to performance prediction is carried out to the workload in distributed data base, thus Convenient for the distribution of subsequent task.Due to the particularity of distributed data base, pervious database performance prediction technique is uncomfortable For present distributed data base, and in existing performance prediction method, there is no carry out for concurrent workload Performance prediction.
Summary of the invention
The object of the present invention is to provide the performance prediction methods that concurrent efforts in a kind of distributed data base load.This Invention can carry out good performance prediction to concurrent efforts load in distributed data base, consequently facilitating subsequent task point Match, so as to shorten the average latency of inquiry.
Technical solution of the present invention: the performance prediction method of concurrent efforts load in a kind of distributed data base, by building Vertical multiple linear regression model, for judging the interaction in distributed data base between inquiry, and prediction distribution formula data Inquiry time delay L in the system of library under different degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its Key step includes:
A, the metric selection of inquiry time delay L;
B, multiple linear regression model is established in the interaction under inquiry combination complications;
C, the correctness and validity of experimental demonstration multiple linear regression model.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the inquiry time delay L packet in step A Network delay and processing locality are included.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network delay uses network Transmission quantity N is as its metric;The processing locality reads number Y as its metric using I/O block.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the step B is by following several Part is constituted:
B1: predicted query interaction;
B2: predicted query delay;
B3: the linear regression model (LRM) training based on sampling.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the B1 step includes: main I/O block of the inquiry q in the case where concurrently executing with secondary inquiry p1...pn reads the prediction of number Y and network transmission volume N;Its Middle I/O block reads number Y and passes through following linear forecast of regression model:
Network transmission volume N passes through following linear forecast of regression model:
The step B2 predicts inquiry time delay L by following linear regression model:
L=Cq1*Bq2*Nq(3);
The step B3 are as follows: generate different inquiry groups by providing 2 or more inquiries, and using stratified sampling function It closes, and the pairs of different inquiry that runs is combined, I/O block when recording each inquiry combination reads number Y and network transmission volume N Sample is formed, estimates the factor beta of linear regression model (LRM) by least square method using sample1, β2, β3And β4
In formula, BqBased on inquire q I/O block read number;
I/O block for all secondary inquiries reads the sum of number;
The sum of number is read to the I/O block for directly affecting value of main inquiry for all secondary inquiries;
The I/O of indirect influence value between all secondary inquiries reads the sum of number fastly;
NqBased on inquire q network transmission volume;
For the sum of the network transmission volume of all secondary inquiries;
It is all secondary inquiries to the sum of the network transmission volume for directly affecting value of main inquiry;
The sum of the network transmission volume of indirect influence value between all secondary inquiries;
CqFor the CPU overhead time for inquiring q.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the step C are as follows: in polynary line Property regression model in operation inquiry Q1, Q2, Q3 ... Qn obtain measured value, measured value is then put into multiple linear regression model Middle output obtains predicted value, and the sampling of predicted value a part is divided into test data set, and another part is divided into training dataset, and sees Examine the fit solution between predicted value and measured value.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network transmission volume N is used Network transmission package number between node is as initial data when measuring query execution.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network transmission package number and I/ O block is read number Y and is obtained using SystemTap.
Beneficial effects of the present invention: compared with prior art, in distributed data base due to the biography of data between node It is defeated, network overhead can be related to when executing inquiry in system, therefore when predicting concurrent query execution performance, it is contemplated that net Network delay, linear regression model (LRM) is proposed in the present invention, and to carry out in prediction distribution formula Database Systems concurrently to execute analytic type work negative Reciprocation when load.Since network delay and processing locality are query execution time most important two factors, at this In invention, the present invention utilizes linear regression model (LRM) from network delay, processing locality and different degrees of three aspects of concurrency To analyze query execution behavior.In addition, the inquiry combination under different degree of concurrence is obtained in the present invention using sampling techniques.This The model of invention be in the cluster constructed by PostgreSQL, using typical analytic type workload TPC-H data set come Performance prediction is completed, by repeatedly experiments have shown that inquiry time delay, network delay and I/O block read the total average relative error of number Y Respectively 14%, 30% and 37%, from experimental result it can be seen that linear regression model (LRM) proposed by the present invention can be very good pair Distributed data base carries out the responsible performance prediction of concurrent efforts, consequently facilitating the subsequent task distribution of database, can shorten The average latency of inquiry.
Detailed description of the invention
Attached drawing 1 is the predicted value of inquiry time delay of the invention and the fitting schematic diagram of measured value;
Attached drawing 2 is that I/O block of the invention reads the predicted value of number Y and the fitting schematic diagram of measured value;
Attached drawing 3 is the predicted value of network delay of the invention and the fitting schematic diagram of measured value;
Attached drawing 4 is average relative error schematic diagram of the present invention when degree of concurrence is 3;
Attached drawing 5 is average relative error schematic diagram of the present invention when degree of concurrence is 4;
Specific embodiment
The present invention is further illustrated with reference to the accompanying drawings and examples, but be not intended as to the present invention limit according to According to.
The embodiment of the present invention:
1. performance prediction
It is an object of the present invention to study the concurrent inquiry time delay performance prediction in distributed data base system.Distributed data Resource contention of the performance mainly in the case of by shared basic resource in the system of library is influenced, and shared resource has RAM, CPU, disk I/O, network bandwidth etc..Therefore in the present invention, selection first can be used for concurrent efforts to load lower inquiry time delay performance pre- The availability magnitude of survey, especially for distributed data base system.
The present invention focuses on the concurrent inquiry time delay of prediction distribution formula analytic type workload.In distributed data base system Analytic type inquiry relates generally to two aspects of network delay and processing locality.
Processing locality is the data that an inquiry for retrieving and processing acquires its needs from node.Processing locality Time is the time average between the block for being submitted to needs when the request for retrieving data on node is returned.For logic I/O is requested, and the search disk of processing locality needs many times, a series of continuous reading is write with a small amount of, or caches and delay Deposit the access in pond.In general, the processing locality time is mostly used in I/O operation, in addition read operation is far more than write operation.Therefore exist In the present invention, number Y is read as the measurement for measuring progress inquiry time delay performance prediction in processing locality dimension using average I/O block Value.
Since the data in distributed data base are handled by dispersion/collection mode, when executing inquiry Network transmission be necessary.Data are divided and store in multiple distributed nodes in the cluster, and the data of transmission may It is the partial query result obtained from local node, or final result is returned to the node for submitting this request.Network Transmission quantity N is the factor that inquiry time delay is influenced in distributed data base system, therefore network delay is used it as in the present invention Metric in dimension.
Present invention is generally directed in the research of medium complicated analysis aspect.The present invention has chosen 10 from TPC-H thus Medium complex query sentence combines to form query analysis of the invention, these query statements are conceived to distributed data base system Concurrent execution performance.Firstly, choosing 10 query statements under different degrees of complications, generate 10G's using TPC-H The inquiry time delay time that the query statement obtains measurement is run on data set, then the PostgreSQL cluster that is made of 4 nodes, Middle MPL represents the concurrent quantity of query execution, embodies degree of concurrence.Such as table 1, the present invention can be seen that not every query statement Delay the trend of existing linear increase can all be presented with the increase of number of concurrent.
The average lookup delay of 10 inquiries under different degree of concurrence of table 1.
Inquiry MPL1 MPL2 MPL3 MPL4
3 0.07 0.13 0.12 0.10
4 5.23 5.48 5.32 5.61
5 8.92 9.62 9.70 10.46
6 2.63 3.14 2.76 2.80
7 27.80 29.48 31.03 32.06
8 26.95 28.24 31.85 28.12
10 3.13 3.68 3.61 3.71
14 3.50 4.10 3.84 4.11
18 83.14 93.47 87.93 86.03
19 4.83 5.90 5.92 6.19
2. interaction modeling
The present invention will be respectively adopted I/0 block and read number Y and network transmission volume N as being originally located in the discussion of part in front Metric in reason and two dimensions of network delay, to predict that difference inquires combined performance on different degree of concurrence.Therefore exist This part, the invention proposes two multiple linear regression models to study interaction of the inquiry combination under complications. Then the present invention proposes a linear regression model (LRM) again, and pre- to carry out inquiry time delay using I/O block reading number Y and network delay It surveys.It is finally trained to obtain prediction model of the invention using the data set after sampling.
For predicted query influencing each other under concurrent executive condition, present invention judgement first is concurrent at two when inquiring In the case of execute when I/O block read number Y and network transmission volume N in terms of influence, be then gradually increased concurrency.Especially originally Invention constructs multiple linear regression model to analyze influencing each other under two degree of concurrence.In order to make model of the invention more Add and be readily appreciated that, inquiry is divided into main inquiry and secondary inquiry by the present invention.Main inquiry is that the present invention wants to study it under complications It is affected the inquiry of situation, pair inquiry is and the main inquiry inquired and concurrently executed.Before introducing model proposed by the present invention, first Introduce correlated variables.The value of these variables can be concentrated from training data and be obtained.
Separation number: the present invention proposes this variable as a base value, that is, when main inquiry executes under no complications Value.The present invention is using this value as a base value for judging analog value under complications.Such as inquiring i, present invention Bi It indicates that I/O block reads number, uses NiIndicate network transmission volume.
Concurrent value: likewise, the value of this variable is the separation number summation concurrently inquired, the B in example as aboveiOr Ni
Directly affect value: the present invention indicates influence of the secondary inquiry to main inquiry using this value, it is the metric of variation Summation.Such as when i is main inquiry, j is secondary inquiry, and for network transmission volume N, Ni/jExpression directly affects metric, Changing value is Δ Ni/j=Ni/j-Ni
Indirect influence value: the present invention indicates that secondary inquiry directly influences each other using this variable, and value is secondary inquiry Directly affect the sum of value.
Therefore, the present invention carrys out predicted query q its average I/ the case where concurrently executing with p1...pn using following formula O block reads number Y and network transmission volume N:
The present invention will estimate the factor beta of each inquiry using least square method1, β2, β3, β4
These coefficients will be concentrated from training data to be obtained by training.
The present invention considers that I/O block reads two aspects of number Y and network transmission volume N to establish linear regression model (LRM) prediction simultaneously The inquiry time delay of each inquiry.For distributed data base system, inquiry time delay is mainly by network delay and local Processing is to form, and the processing locality time mainly includes specific CPU overhead time and average logical I/O waiting time, because The inquiry time delay of this inquiry q can be predicted by following formula:
In above-mentioned formula 1,2 and 3, Cq represents the particular CPU overhead time of inquiry q, and Bq indicates average I/O block reading time Number, Nq indicate the averaging network transmission quantity in distributed data base between multiple nodes.
The present invention obtains factor beta for experiment repeatedly is carried out using least square method to sample1, β2
In order to more easily understand model proposed by the present invention, a simply example is next introduced.If of the invention Think that prediction inquires a with inquiry b in distributed data base system, inquiry time delay under the concurrent executive condition of c, the present invention is first It needs to calculate following values:
A is inquired, the isolation I/O block of b, c read time numerical value Ba, Bb and Bc and isolation network transmits magnitude: Na, Nb and Nc.
With inquiry b, under the concurrent executive condition of c, that inquires a directly affects value: △ Ba/b, △ Na/b,
△Ba/c,△Na/c
Indirect influence value: △ Bc/b, △ Nc/b, △ Bb/c, △ Nb/c
The present invention can obtain corresponding metric by following two formula respectively.
Ba1Ba2(Bb+Bc)+β3(△Ba/b+△Ba/c)+β4(△Bc/b+△Bb/c)
Na1Na2(Nb+Nc)+β3(△Na/b+△Na/c)+β4(△Nc/b+△Nb/c)
Next the inquiry time delay that formula 3 will be used to carry out predicted query a:
La=Ca1*Ba2*Na
In order to obtain inquiry time delay from previously described formula 3, need to train prediction model of the invention.Firstly, giving Feature when 10 inquiries are separately operable out, as query execution delay, I/O block read number Y and network delay, this 10 are looked into Inquiry is the baseline query statement that various inquiries are combined at different MPLs, when the pairs of operation inquiry of the present invention, such as 55 Inquiry can obtain how they influence the specific feature of other side in pairs.
In order to run these query statements, while the interbehavior of higher degree on multiple machines, the present invention uses LHS Workload required for the present invention is represented to generate different inquiries combination.LHS is a stratified sampling function, it can be very It is convenient to generate sample data.In table 2, the present invention can be seen in this example to the LHS example come out on MPL2 5 pairs of inquiries are generated to LHS.In an experiment, the I/O block reading times of each inquiry are recorded and carry out each inquiry Network transmission volume N when combination forms sample, these samples are used to the coefficient of appraising model.For each inquiry, generate It is many to inquire example combinations to form sample.For example, inquiry 3 indicates Q3, Q4, Q5 combination, but Q3 is main inquiry, and Q4, Q5 are then It is secondary inquiry.
Table 2.2 ties up LHS example
Inquiry 1 2 3 4 5
1 X
2 X
3 X
4 X
5 X
For the linear regression model (LRM) established, need to assess the correctness and validity of three models of proposition.Pass through Experiment will understand inquiry time delay respectively, and network transmission volume N and I/O block reads the measured value and predicted value situation of number Y, and for simultaneously When hair degree is 3 and 4, the average opposite error rate of each inquiry is understood.
3. experimental demonstration
For the feasibility of appraisal procedure and the accuracy of model, the 10G number generated in the QGEN provided by TPC-H is selected According to executing inquiry on collection.Since this research lays particular emphasis on analytical workload, the present invention is selected from 22 inquiries of TPC-H Take Q3, Q4, Q5, Q6, Q7, Q8, Q10, Q14, Q18, Q19 forms inquiry combination of the invention.Choose these inquiry be because These query times are relatively long, can provide more times, are conducive to the present invention and collect I/O block reading number Y and network transmission Measure N.The data-base cluster that the distributed data base system of this experiment is made of four PostgreSQL nodes uses in experiment Postgres-XL realizes this function, and Postgres-XL is the PostgreSQL data-base cluster of an open source, not to processing Same database work load all has high-level retractility and flexibility.Clustered deploy(ment) is in 4 cores, 2 hertz of processors, in 8G It deposits, in the physical machine of model Intel (R) Xeon (R) CPU E5-2620, the operating system of each node operation is that kernel is The Centos 6.4 of Linux 2.6.32.
Training dataset is obtained by sampling techniques first, and obtains multiple linear regression model using Matlab, is connect Predict that the I/O block inquired reads number Y and network delay in the case where concurrently executing using test data set.
And training dataset and test data set then obtain in the following manner, run and look into multiple linear regression model It askes Q1, Q2, Q3 ... Qn and obtains measured value, then measured value is put into multiple linear regression model and is exported, predicted value is obtained, The sampling of predicted value a part is divided into test data set, and another part samples to obtain training dataset, and observes predicted value and measurement Fit solution between value.
Fit solution between predicted value and measured value is presented in Fig. 1.In an experiment, the present invention uses coefficient of determination R2 Come measure regression model whether be fitted it is fine.Coefficient of determination R2Value range be 0 to 1, value is more proximate to 1, illustrate in advance Measured value and measured value are closer, and regression model of the invention is better.Fig. 1,2,3 respectively illustrate under more complications, benefit Inquiry time delay, network delay and the I/O block obtained with prediction model proposed by the present invention reads number Y, in predicted value and measured value Between fit solution.Successively R2Value be respectively 0.94,0.58 and 0.84, this illustrates in this research work using network delay The ability that number Y carrys out predicted query delay is read with I/O block.For each inquiry, come first using the model in formula 1 and 2 It predicts that network delay and I/O block read number Y, finally carrys out predicted query using formula 3 and be delayed.In an experiment, the present invention is using section Point between network transmission package number as measurement query execution when network transmission volume N initial data.
It needs it is worth noting that the method for obtaining initial data, is handled later because these initial data are passed through as sample This, and two key factors for influencing linear regression model (LRM) quality are the quality and quantity of sample respectively, therefore obtain original number According to method it is critically important.
Number Y and network transmission volume N is read in order to collect I/O block when each query execution, is used in this research SystemTap executes the script write and carrys out dynamic acquisition data.SystemTap is to monitor and track running linux kernel Operation dynamic approach.Simple command Window and scripting language are provided for user.PostgreSQL is captured certainly with passing through The statistical data of body is compared using other tools to obtain network delay, can more obtain accurate net using SystemTap Network transmission packet and I/O block read number Y.In addition to obtaining more times to obtain data, also for the data for making acquisition It is more accurate, the present invention shared_buffer value appropriate for having adjusted PostgreSQL.
As previously mentioned, the present invention obtains the coefficient in model using common least square method (OLS).According to common minimum Square law at least needs for empirically 6 samples to carry out predicted query delay and is just able to satisfy basic demand.In experiment, this Invention uses 120 sample values, reads number Y for the purposes of prediction network delay and I/O block, also at least needs 13 samples, 140 samples have been used in this research., it was also found that whole variation tendency is not also sent out when increasing sample size in experiment Raw special change, it is opposite only to make a little more crypto set.In Fig. 1, show in the figure therefore do not have to make more to put There is the king-sized point of displaying value, for example the query execution time of inquiry 18 does not just embody in Fig. 1.In addition, in Fig. 3 Prediction for network delay, the present invention can see the higher or relatively low point of some predictions.This is because Experimental Network Fluctuation or the packet loss when collecting data keep the error of predicted value and observation larger.
In addition, all without removing caching when executing inquiry every time in order to closer to practical application scene, in experiment, this Be why when increasing degree of concurrence prediction accuracy have slightly reduce one of.I/O in comparison diagram 4 and Fig. 5 Block reads the average relative error of number Y and can find this phenomenon.
Comparison diagram 4 and Fig. 5 are also found that average relative error of the inquiry 3 (Q3) when concurrency is 3 is higher than concurrency When being 4.By analysis it is known that this is because the execution time of 3 (Q3) of inquiry is too short so that it cannot more accurate obtain Source data, sample quality is too low to be made to predict that error is higher.
Fig. 4, Fig. 5 are respectively shown when degree of concurrence is 3,4, and inquiry time delay, network delay and I/O block read number Y's Average relative error, wherein average relative error passes through | (measured value-predicted value)/measured value | obtained by calculating.Inquiry time delay, net It is respectively 14%, 30% and 37% that network delay and I/O block, which read the total average relative error of number Y,.It should be the experimental results showed that using Model proposed by the present invention can be very good the performance prediction that concurrent efforts load is carried out to distributed data base system.

Claims (4)

1. the performance prediction method that concurrent efforts load in a kind of distributed data base, it is characterised in that: by establishing polynary line Property regression model, it is different for judging in distributed data base the interaction between inquiry, and in prediction distribution formula database Inquiry time delay L under degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its key step includes Have:
A, the metric selection of inquiry time delay L;
B, multiple linear regression model is established in the interaction under inquiry combination complications;
C, the correctness and validity of experimental demonstration multiple linear regression model;
Inquiry time delay L in step A includes network delay and processing locality;
The network delay is using network transmission volume N as its metric;The processing locality reads number Y conduct using I/O block Its metric;
The step B is made of following several parts:
B1: predicted query interaction;
B2: predicted query delay;
B3: the linear regression model (LRM) training based on sampling;
The B1 step includes: I/O block reading number Y of the main inquiry q in the case where concurrently being executed with secondary inquiry p1...pn with And the prediction of network transmission volume N;Wherein I/O block reading number Y passes through following linear forecast of regression model:
Network transmission volume N passes through following linear forecast of regression model:
The step B2 predicts inquiry time delay L by following linear regression model:
L=Cq1*Bq2*Nq(3);
The step B3 are as follows: by providing 2 or more inquiries, and different inquiries are generated using stratified sampling function and are combined, and The different inquiry combination of pairs of operation, I/O block when recording each inquiry combination read number Y and network transmission volume N and carry out group At sample, the factor beta of linear regression model (LRM) is estimated by least square method using sample1, β2, β3And β4
In formula, BqBased on inquire q I/O block read number;
I/O block for all secondary inquiries reads the sum of number;
The sum of number is read to the I/O block for directly affecting value of main inquiry for all secondary inquiries;
The I/O of indirect influence value between all secondary inquiries reads the sum of number fastly;
NqBased on inquire q network transmission volume;
For the sum of the network transmission volume of all secondary inquiries;
It is all secondary inquiries to the sum of the network transmission volume for directly affecting value of main inquiry;
The sum of the network transmission volume of indirect influence value between all secondary inquiries;
CqFor the CPU overhead time for inquiring q.
2. the performance prediction method that concurrent efforts load in distributed data base according to claim 1, it is characterised in that: The step C are as follows: operation inquiry Q1, Q2, Q3 ... Qn obtains measured value in multiple linear regression model, then by measured value It is put into multiple linear regression model and exports, obtain predicted value, the sampling of predicted value a part is divided into test data set, another part It is divided into training dataset, and observes the fit solution between predicted value and measured value.
3. the performance prediction method that concurrent efforts load in distributed data base according to claim 1, it is characterised in that: The network transmission volume N uses initial data of the network transmission package number between node as measurement query execution when.
4. the performance prediction method that concurrent efforts load in distributed data base according to claim 3, it is characterised in that: The network transmission package number and I/O block are read number Y and are obtained using SystemTap.
CN201510881758.5A 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base Active CN105512264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510881758.5A CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510881758.5A CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Publications (2)

Publication Number Publication Date
CN105512264A CN105512264A (en) 2016-04-20
CN105512264B true CN105512264B (en) 2019-04-19

Family

ID=55720246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510881758.5A Active CN105512264B (en) 2015-12-04 2015-12-04 The performance prediction method that concurrent efforts load in distributed data base

Country Status (1)

Country Link
CN (1) CN105512264B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451041B (en) * 2017-07-24 2019-11-22 华中科技大学 A kind of object cloud storage system response delay prediction technique
CN107679243A (en) * 2017-10-31 2018-02-09 麦格创科技(深圳)有限公司 Task distributes the application process and system in distributed system
CN109308193B (en) * 2018-09-06 2019-08-09 广州市品高软件股份有限公司 A kind of multi-tenant function calculates the concurrency control method of service
CN110210000A (en) * 2019-04-18 2019-09-06 贵州大学 The identification of industrial process efficiency and diagnostic method based on Multiple Non Linear Regression
CN111782396B (en) * 2020-07-01 2022-12-23 浪潮云信息技术股份公司 Concurrency elastic control method based on distributed database
CN112307042B (en) * 2020-11-01 2024-09-17 宋清卿 Database load analysis method for query intensive data storage processing system
US11568320B2 (en) * 2021-01-21 2023-01-31 Snowflake Inc. Handling system-characteristics drift in machine learning applications
CN113157814B (en) * 2021-01-29 2023-07-18 东北大学 Query-driven intelligent workload analysis method under relational database
CN113485638B (en) * 2021-06-07 2022-11-11 贵州大学 Access optimization system for massive astronomical data
CN113296964B (en) * 2021-07-28 2022-01-04 阿里云计算有限公司 Data processing method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609416B (en) * 2009-07-13 2012-11-14 清华大学 Method for improving performance tuning speed of distributed system
CN101841565B (en) * 2010-04-20 2013-07-31 中国科学院软件研究所 Database cluster system load balancing method and database cluster system
CN104794186B (en) * 2015-04-13 2017-10-27 太原理工大学 The acquisition method of database loads response time forecast model training sample

Also Published As

Publication number Publication date
CN105512264A (en) 2016-04-20

Similar Documents

Publication Publication Date Title
CN105512264B (en) The performance prediction method that concurrent efforts load in distributed data base
Li et al. A platform for scalable one-pass analytics using mapreduce
US11429584B2 (en) Automatic determination of table distribution for multinode, distributed database systems
US10621064B2 (en) Proactive impact measurement of database changes on production systems
US8190598B2 (en) Skew-based costing for database queries
CN107908536B (en) Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment
CN105512465B (en) Based on the cloud platform safety quantitative estimation method for improving VIKOR methods
CN110377519B (en) Performance capacity test method, device and equipment of big data system and storage medium
US10411969B2 (en) Backend resource costs for online service offerings
Ortiz et al. A vision for personalized service level agreements in the cloud
CN108733781A (en) The cluster temporal data indexing means calculated based on memory
CN113157541B (en) Multi-concurrency OLAP type query performance prediction method and system for distributed database
Vrbić Data mining and cloud computing
Awada et al. Cost Estimation Across Heterogeneous SQL-Based Big Data Infrastructures in Teradata IntelliSphere.
Molka et al. Experiments or simulation? A characterization of evaluation methods for in-memory databases
Kusuma et al. Performance comparison of caching strategy on wordpress multisite
Öztürk Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization
Tabet et al. Towards a new data replication strategy in mongodb systems
Barajas et al. Benchmarking parallel k-means cloud type clustering from satellite data
Hagedorn et al. Cost-based sharing and recycling of (intermediate) results in dataflow programs
Rakhmawati et al. On Metrics for Measuring Fragmentation of Federation over SPARQL Endpoints.
Burdakov et al. Predicting SQL Query Execution Time with a Cost Model for Spark Platform.
Yu et al. Performance studies of a websphere application, trade, in scale-out and scale-up environments
CN112182076A (en) Variable selection method combining different source data
Wang et al. Skew‐aware online aggregation over joins through guided sampling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant