CN105512264B - The performance prediction method that concurrent efforts load in distributed data base - Google Patents
The performance prediction method that concurrent efforts load in distributed data base Download PDFInfo
- Publication number
- CN105512264B CN105512264B CN201510881758.5A CN201510881758A CN105512264B CN 105512264 B CN105512264 B CN 105512264B CN 201510881758 A CN201510881758 A CN 201510881758A CN 105512264 B CN105512264 B CN 105512264B
- Authority
- CN
- China
- Prior art keywords
- inquiry
- regression model
- data base
- network transmission
- distributed data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses the performance prediction methods that concurrent efforts in a kind of distributed data base load, establish linear regression model (LRM), for judging the interaction in distributed data base between inquiry, and the inquiry time delay L in prediction distribution formula Database Systems under different degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its key step includes: A, the metric selection of inquiry time delay L;B, linear regression model (LRM) is established in the interaction under inquiry combination complications;C, the correctness and validity of experimental demonstration linear regression model (LRM).By repeatedly experiments have shown that inquiry time delay, it is respectively 14% that network delay and I/O block, which read the total average relative error of number, 30% and 37%, from experimental result it can be seen that linear regression model (LRM) proposed by the present invention can be very good to carry out distributed data base the responsible performance prediction of concurrent efforts, consequently facilitating the subsequent task distribution of database, can shorten the average latency of inquiry.
Description
Technical field
The present invention relates to a kind of performance prediction method of workload in database, in especially a kind of distributed data base
The performance prediction method of concurrent efforts load.
Background technique
Now currently, the performance prediction for database work load has had relevant research.But study now
Database is limited only to the database of single node, that is to say, that the database only has a server, for a server
How is its performance, depends primarily on the disk of server and the utilization rate of CPU.The data volume generated with research and industrial circle
Growth, distributed data base system is applied to PB grades of data of storage and management, and provides high concurrency and scalability.Point
Data in cloth database pass through dispersion/collection call by pattern to processing.For example, inquiry can be more out by a node split
A subquery, the execution that these subqueries can be concurrent by a lot of other nodes, then the partial results of each node will
It on back to this node and is combined, obtains final result of query execution.Therefore in distributed data base, data
It is on the multiple distributed nodes dividedly stored in the cluster, and this cluster can be by adding new node come very
Readily it is extended.Namely why distributed data base is used to store and process one of big data for this.In general,
Distributed data base is used to support concurrently to execute analytic type workload, to reduce the query execution time needed.However, concurrent
There is also the challenges in terms of resource contention while bringing a large amount of advantages for execution, for example judge mutual between multiple queries
Effect.Interaction between multiple queries may be different.When two, which are inquired, shares a table scan, between each other may be used
It can be positive effect.On the contrary, will mutually increase because of network delay when two inquiries require high network transmission bandwidth
Add the query execution time.For single-node data library, the distribution of task is limited only to a server, but for
For the distributed data base of multinode, then there are many selections for the distribution of task, how to be distributed by task, to realize inquiry
Average latency is shorter, is that database should consider in carry out task distribution.For example, there are 3 in distributed data base
A server, and query task all is being executed, No. 1 server disk, cpu busy percentage are relatively low;No. 2 and No. 3 server disks
It is relatively high with cpu busy percentage, if having carried out 1 inquiry at this time, assign them to which server is just contemplated.Such as
There are also other inquiries to increase in waiting or subsequent time its disk, cpu busy percentage for No. 1 server of fruit, and under 2, No. 3 servers
When one moment disk, CPU are reduced, then should just consider to assign the task to No. 3 servers of 2 goods when carrying out task distribution, and
It is not allocated to No. 1 server.It is therefore desirable to performance prediction is carried out to the workload in distributed data base, thus
Convenient for the distribution of subsequent task.Due to the particularity of distributed data base, pervious database performance prediction technique is uncomfortable
For present distributed data base, and in existing performance prediction method, there is no carry out for concurrent workload
Performance prediction.
Summary of the invention
The object of the present invention is to provide the performance prediction methods that concurrent efforts in a kind of distributed data base load.This
Invention can carry out good performance prediction to concurrent efforts load in distributed data base, consequently facilitating subsequent task point
Match, so as to shorten the average latency of inquiry.
Technical solution of the present invention: the performance prediction method of concurrent efforts load in a kind of distributed data base, by building
Vertical multiple linear regression model, for judging the interaction in distributed data base between inquiry, and prediction distribution formula data
Inquiry time delay L in the system of library under different degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its
Key step includes:
A, the metric selection of inquiry time delay L;
B, multiple linear regression model is established in the interaction under inquiry combination complications;
C, the correctness and validity of experimental demonstration multiple linear regression model.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the inquiry time delay L packet in step A
Network delay and processing locality are included.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network delay uses network
Transmission quantity N is as its metric;The processing locality reads number Y as its metric using I/O block.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the step B is by following several
Part is constituted:
B1: predicted query interaction;
B2: predicted query delay;
B3: the linear regression model (LRM) training based on sampling.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the B1 step includes: main
I/O block of the inquiry q in the case where concurrently executing with secondary inquiry p1...pn reads the prediction of number Y and network transmission volume N;Its
Middle I/O block reads number Y and passes through following linear forecast of regression model:
Network transmission volume N passes through following linear forecast of regression model:
The step B2 predicts inquiry time delay L by following linear regression model:
L=Cq+β1*Bq+β2*Nq(3);
The step B3 are as follows: generate different inquiry groups by providing 2 or more inquiries, and using stratified sampling function
It closes, and the pairs of different inquiry that runs is combined, I/O block when recording each inquiry combination reads number Y and network transmission volume N
Sample is formed, estimates the factor beta of linear regression model (LRM) by least square method using sample1, β2, β3And β4;
In formula, BqBased on inquire q I/O block read number;
I/O block for all secondary inquiries reads the sum of number;
The sum of number is read to the I/O block for directly affecting value of main inquiry for all secondary inquiries;
The I/O of indirect influence value between all secondary inquiries reads the sum of number fastly;
NqBased on inquire q network transmission volume;
For the sum of the network transmission volume of all secondary inquiries;
It is all secondary inquiries to the sum of the network transmission volume for directly affecting value of main inquiry;
The sum of the network transmission volume of indirect influence value between all secondary inquiries;
CqFor the CPU overhead time for inquiring q.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the step C are as follows: in polynary line
Property regression model in operation inquiry Q1, Q2, Q3 ... Qn obtain measured value, measured value is then put into multiple linear regression model
Middle output obtains predicted value, and the sampling of predicted value a part is divided into test data set, and another part is divided into training dataset, and sees
Examine the fit solution between predicted value and measured value.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network transmission volume N is used
Network transmission package number between node is as initial data when measuring query execution.
In the performance prediction method that concurrent efforts load in distributed data base above-mentioned, the network transmission package number and I/
O block is read number Y and is obtained using SystemTap.
Beneficial effects of the present invention: compared with prior art, in distributed data base due to the biography of data between node
It is defeated, network overhead can be related to when executing inquiry in system, therefore when predicting concurrent query execution performance, it is contemplated that net
Network delay, linear regression model (LRM) is proposed in the present invention, and to carry out in prediction distribution formula Database Systems concurrently to execute analytic type work negative
Reciprocation when load.Since network delay and processing locality are query execution time most important two factors, at this
In invention, the present invention utilizes linear regression model (LRM) from network delay, processing locality and different degrees of three aspects of concurrency
To analyze query execution behavior.In addition, the inquiry combination under different degree of concurrence is obtained in the present invention using sampling techniques.This
The model of invention be in the cluster constructed by PostgreSQL, using typical analytic type workload TPC-H data set come
Performance prediction is completed, by repeatedly experiments have shown that inquiry time delay, network delay and I/O block read the total average relative error of number Y
Respectively 14%, 30% and 37%, from experimental result it can be seen that linear regression model (LRM) proposed by the present invention can be very good pair
Distributed data base carries out the responsible performance prediction of concurrent efforts, consequently facilitating the subsequent task distribution of database, can shorten
The average latency of inquiry.
Detailed description of the invention
Attached drawing 1 is the predicted value of inquiry time delay of the invention and the fitting schematic diagram of measured value;
Attached drawing 2 is that I/O block of the invention reads the predicted value of number Y and the fitting schematic diagram of measured value;
Attached drawing 3 is the predicted value of network delay of the invention and the fitting schematic diagram of measured value;
Attached drawing 4 is average relative error schematic diagram of the present invention when degree of concurrence is 3;
Attached drawing 5 is average relative error schematic diagram of the present invention when degree of concurrence is 4;
Specific embodiment
The present invention is further illustrated with reference to the accompanying drawings and examples, but be not intended as to the present invention limit according to
According to.
The embodiment of the present invention:
1. performance prediction
It is an object of the present invention to study the concurrent inquiry time delay performance prediction in distributed data base system.Distributed data
Resource contention of the performance mainly in the case of by shared basic resource in the system of library is influenced, and shared resource has RAM, CPU, disk
I/O, network bandwidth etc..Therefore in the present invention, selection first can be used for concurrent efforts to load lower inquiry time delay performance pre-
The availability magnitude of survey, especially for distributed data base system.
The present invention focuses on the concurrent inquiry time delay of prediction distribution formula analytic type workload.In distributed data base system
Analytic type inquiry relates generally to two aspects of network delay and processing locality.
Processing locality is the data that an inquiry for retrieving and processing acquires its needs from node.Processing locality
Time is the time average between the block for being submitted to needs when the request for retrieving data on node is returned.For logic
I/O is requested, and the search disk of processing locality needs many times, a series of continuous reading is write with a small amount of, or caches and delay
Deposit the access in pond.In general, the processing locality time is mostly used in I/O operation, in addition read operation is far more than write operation.Therefore exist
In the present invention, number Y is read as the measurement for measuring progress inquiry time delay performance prediction in processing locality dimension using average I/O block
Value.
Since the data in distributed data base are handled by dispersion/collection mode, when executing inquiry
Network transmission be necessary.Data are divided and store in multiple distributed nodes in the cluster, and the data of transmission may
It is the partial query result obtained from local node, or final result is returned to the node for submitting this request.Network
Transmission quantity N is the factor that inquiry time delay is influenced in distributed data base system, therefore network delay is used it as in the present invention
Metric in dimension.
Present invention is generally directed in the research of medium complicated analysis aspect.The present invention has chosen 10 from TPC-H thus
Medium complex query sentence combines to form query analysis of the invention, these query statements are conceived to distributed data base system
Concurrent execution performance.Firstly, choosing 10 query statements under different degrees of complications, generate 10G's using TPC-H
The inquiry time delay time that the query statement obtains measurement is run on data set, then the PostgreSQL cluster that is made of 4 nodes,
Middle MPL represents the concurrent quantity of query execution, embodies degree of concurrence.Such as table 1, the present invention can be seen that not every query statement
Delay the trend of existing linear increase can all be presented with the increase of number of concurrent.
The average lookup delay of 10 inquiries under different degree of concurrence of table 1.
Inquiry | MPL1 | MPL2 | MPL3 | MPL4 |
3 | 0.07 | 0.13 | 0.12 | 0.10 |
4 | 5.23 | 5.48 | 5.32 | 5.61 |
5 | 8.92 | 9.62 | 9.70 | 10.46 |
6 | 2.63 | 3.14 | 2.76 | 2.80 |
7 | 27.80 | 29.48 | 31.03 | 32.06 |
8 | 26.95 | 28.24 | 31.85 | 28.12 |
10 | 3.13 | 3.68 | 3.61 | 3.71 |
14 | 3.50 | 4.10 | 3.84 | 4.11 |
18 | 83.14 | 93.47 | 87.93 | 86.03 |
19 | 4.83 | 5.90 | 5.92 | 6.19 |
2. interaction modeling
The present invention will be respectively adopted I/0 block and read number Y and network transmission volume N as being originally located in the discussion of part in front
Metric in reason and two dimensions of network delay, to predict that difference inquires combined performance on different degree of concurrence.Therefore exist
This part, the invention proposes two multiple linear regression models to study interaction of the inquiry combination under complications.
Then the present invention proposes a linear regression model (LRM) again, and pre- to carry out inquiry time delay using I/O block reading number Y and network delay
It surveys.It is finally trained to obtain prediction model of the invention using the data set after sampling.
For predicted query influencing each other under concurrent executive condition, present invention judgement first is concurrent at two when inquiring
In the case of execute when I/O block read number Y and network transmission volume N in terms of influence, be then gradually increased concurrency.Especially originally
Invention constructs multiple linear regression model to analyze influencing each other under two degree of concurrence.In order to make model of the invention more
Add and be readily appreciated that, inquiry is divided into main inquiry and secondary inquiry by the present invention.Main inquiry is that the present invention wants to study it under complications
It is affected the inquiry of situation, pair inquiry is and the main inquiry inquired and concurrently executed.Before introducing model proposed by the present invention, first
Introduce correlated variables.The value of these variables can be concentrated from training data and be obtained.
Separation number: the present invention proposes this variable as a base value, that is, when main inquiry executes under no complications
Value.The present invention is using this value as a base value for judging analog value under complications.Such as inquiring i, present invention Bi
It indicates that I/O block reads number, uses NiIndicate network transmission volume.
Concurrent value: likewise, the value of this variable is the separation number summation concurrently inquired, the B in example as aboveiOr Ni。
Directly affect value: the present invention indicates influence of the secondary inquiry to main inquiry using this value, it is the metric of variation
Summation.Such as when i is main inquiry, j is secondary inquiry, and for network transmission volume N, Ni/jExpression directly affects metric,
Changing value is Δ Ni/j=Ni/j-Ni。
Indirect influence value: the present invention indicates that secondary inquiry directly influences each other using this variable, and value is secondary inquiry
Directly affect the sum of value.
Therefore, the present invention carrys out predicted query q its average I/ the case where concurrently executing with p1...pn using following formula
O block reads number Y and network transmission volume N:
The present invention will estimate the factor beta of each inquiry using least square method1, β2, β3, β4;
These coefficients will be concentrated from training data to be obtained by training.
The present invention considers that I/O block reads two aspects of number Y and network transmission volume N to establish linear regression model (LRM) prediction simultaneously
The inquiry time delay of each inquiry.For distributed data base system, inquiry time delay is mainly by network delay and local
Processing is to form, and the processing locality time mainly includes specific CPU overhead time and average logical I/O waiting time, because
The inquiry time delay of this inquiry q can be predicted by following formula:
In above-mentioned formula 1,2 and 3, Cq represents the particular CPU overhead time of inquiry q, and Bq indicates average I/O block reading time
Number, Nq indicate the averaging network transmission quantity in distributed data base between multiple nodes.
The present invention obtains factor beta for experiment repeatedly is carried out using least square method to sample1, β2。
In order to more easily understand model proposed by the present invention, a simply example is next introduced.If of the invention
Think that prediction inquires a with inquiry b in distributed data base system, inquiry time delay under the concurrent executive condition of c, the present invention is first
It needs to calculate following values:
A is inquired, the isolation I/O block of b, c read time numerical value Ba, Bb and Bc and isolation network transmits magnitude: Na, Nb and Nc.
With inquiry b, under the concurrent executive condition of c, that inquires a directly affects value: △ Ba/b, △ Na/b,
△Ba/c,△Na/c。
Indirect influence value: △ Bc/b, △ Nc/b, △ Bb/c, △ Nb/c。
The present invention can obtain corresponding metric by following two formula respectively.
Ba=β1Ba+β2(Bb+Bc)+β3(△Ba/b+△Ba/c)+β4(△Bc/b+△Bb/c)
Na=β1Na+β2(Nb+Nc)+β3(△Na/b+△Na/c)+β4(△Nc/b+△Nb/c)
Next the inquiry time delay that formula 3 will be used to carry out predicted query a:
La=Ca+β1*Ba+β2*Na
In order to obtain inquiry time delay from previously described formula 3, need to train prediction model of the invention.Firstly, giving
Feature when 10 inquiries are separately operable out, as query execution delay, I/O block read number Y and network delay, this 10 are looked into
Inquiry is the baseline query statement that various inquiries are combined at different MPLs, when the pairs of operation inquiry of the present invention, such as 55
Inquiry can obtain how they influence the specific feature of other side in pairs.
In order to run these query statements, while the interbehavior of higher degree on multiple machines, the present invention uses LHS
Workload required for the present invention is represented to generate different inquiries combination.LHS is a stratified sampling function, it can be very
It is convenient to generate sample data.In table 2, the present invention can be seen in this example to the LHS example come out on MPL2
5 pairs of inquiries are generated to LHS.In an experiment, the I/O block reading times of each inquiry are recorded and carry out each inquiry
Network transmission volume N when combination forms sample, these samples are used to the coefficient of appraising model.For each inquiry, generate
It is many to inquire example combinations to form sample.For example, inquiry 3 indicates Q3, Q4, Q5 combination, but Q3 is main inquiry, and Q4, Q5 are then
It is secondary inquiry.
Table 2.2 ties up LHS example
Inquiry | 1 | 2 | 3 | 4 | 5 |
1 | X | ||||
2 | X | ||||
3 | X | ||||
4 | X | ||||
5 | X |
For the linear regression model (LRM) established, need to assess the correctness and validity of three models of proposition.Pass through
Experiment will understand inquiry time delay respectively, and network transmission volume N and I/O block reads the measured value and predicted value situation of number Y, and for simultaneously
When hair degree is 3 and 4, the average opposite error rate of each inquiry is understood.
3. experimental demonstration
For the feasibility of appraisal procedure and the accuracy of model, the 10G number generated in the QGEN provided by TPC-H is selected
According to executing inquiry on collection.Since this research lays particular emphasis on analytical workload, the present invention is selected from 22 inquiries of TPC-H
Take Q3, Q4, Q5, Q6, Q7, Q8, Q10, Q14, Q18, Q19 forms inquiry combination of the invention.Choose these inquiry be because
These query times are relatively long, can provide more times, are conducive to the present invention and collect I/O block reading number Y and network transmission
Measure N.The data-base cluster that the distributed data base system of this experiment is made of four PostgreSQL nodes uses in experiment
Postgres-XL realizes this function, and Postgres-XL is the PostgreSQL data-base cluster of an open source, not to processing
Same database work load all has high-level retractility and flexibility.Clustered deploy(ment) is in 4 cores, 2 hertz of processors, in 8G
It deposits, in the physical machine of model Intel (R) Xeon (R) CPU E5-2620, the operating system of each node operation is that kernel is
The Centos 6.4 of Linux 2.6.32.
Training dataset is obtained by sampling techniques first, and obtains multiple linear regression model using Matlab, is connect
Predict that the I/O block inquired reads number Y and network delay in the case where concurrently executing using test data set.
And training dataset and test data set then obtain in the following manner, run and look into multiple linear regression model
It askes Q1, Q2, Q3 ... Qn and obtains measured value, then measured value is put into multiple linear regression model and is exported, predicted value is obtained,
The sampling of predicted value a part is divided into test data set, and another part samples to obtain training dataset, and observes predicted value and measurement
Fit solution between value.
Fit solution between predicted value and measured value is presented in Fig. 1.In an experiment, the present invention uses coefficient of determination R2
Come measure regression model whether be fitted it is fine.Coefficient of determination R2Value range be 0 to 1, value is more proximate to 1, illustrate in advance
Measured value and measured value are closer, and regression model of the invention is better.Fig. 1,2,3 respectively illustrate under more complications, benefit
Inquiry time delay, network delay and the I/O block obtained with prediction model proposed by the present invention reads number Y, in predicted value and measured value
Between fit solution.Successively R2Value be respectively 0.94,0.58 and 0.84, this illustrates in this research work using network delay
The ability that number Y carrys out predicted query delay is read with I/O block.For each inquiry, come first using the model in formula 1 and 2
It predicts that network delay and I/O block read number Y, finally carrys out predicted query using formula 3 and be delayed.In an experiment, the present invention is using section
Point between network transmission package number as measurement query execution when network transmission volume N initial data.
It needs it is worth noting that the method for obtaining initial data, is handled later because these initial data are passed through as sample
This, and two key factors for influencing linear regression model (LRM) quality are the quality and quantity of sample respectively, therefore obtain original number
According to method it is critically important.
Number Y and network transmission volume N is read in order to collect I/O block when each query execution, is used in this research
SystemTap executes the script write and carrys out dynamic acquisition data.SystemTap is to monitor and track running linux kernel
Operation dynamic approach.Simple command Window and scripting language are provided for user.PostgreSQL is captured certainly with passing through
The statistical data of body is compared using other tools to obtain network delay, can more obtain accurate net using SystemTap
Network transmission packet and I/O block read number Y.In addition to obtaining more times to obtain data, also for the data for making acquisition
It is more accurate, the present invention shared_buffer value appropriate for having adjusted PostgreSQL.
As previously mentioned, the present invention obtains the coefficient in model using common least square method (OLS).According to common minimum
Square law at least needs for empirically 6 samples to carry out predicted query delay and is just able to satisfy basic demand.In experiment, this
Invention uses 120 sample values, reads number Y for the purposes of prediction network delay and I/O block, also at least needs 13 samples,
140 samples have been used in this research., it was also found that whole variation tendency is not also sent out when increasing sample size in experiment
Raw special change, it is opposite only to make a little more crypto set.In Fig. 1, show in the figure therefore do not have to make more to put
There is the king-sized point of displaying value, for example the query execution time of inquiry 18 does not just embody in Fig. 1.In addition, in Fig. 3
Prediction for network delay, the present invention can see the higher or relatively low point of some predictions.This is because Experimental Network
Fluctuation or the packet loss when collecting data keep the error of predicted value and observation larger.
In addition, all without removing caching when executing inquiry every time in order to closer to practical application scene, in experiment, this
Be why when increasing degree of concurrence prediction accuracy have slightly reduce one of.I/O in comparison diagram 4 and Fig. 5
Block reads the average relative error of number Y and can find this phenomenon.
Comparison diagram 4 and Fig. 5 are also found that average relative error of the inquiry 3 (Q3) when concurrency is 3 is higher than concurrency
When being 4.By analysis it is known that this is because the execution time of 3 (Q3) of inquiry is too short so that it cannot more accurate obtain
Source data, sample quality is too low to be made to predict that error is higher.
Fig. 4, Fig. 5 are respectively shown when degree of concurrence is 3,4, and inquiry time delay, network delay and I/O block read number Y's
Average relative error, wherein average relative error passes through | (measured value-predicted value)/measured value | obtained by calculating.Inquiry time delay, net
It is respectively 14%, 30% and 37% that network delay and I/O block, which read the total average relative error of number Y,.It should be the experimental results showed that using
Model proposed by the present invention can be very good the performance prediction that concurrent efforts load is carried out to distributed data base system.
Claims (4)
1. the performance prediction method that concurrent efforts load in a kind of distributed data base, it is characterised in that: by establishing polynary line
Property regression model, it is different for judging in distributed data base the interaction between inquiry, and in prediction distribution formula database
Inquiry time delay L under degree of concurrence, database carry out the selectivity distribution of task by inquiry time delay L;Its key step includes
Have:
A, the metric selection of inquiry time delay L;
B, multiple linear regression model is established in the interaction under inquiry combination complications;
C, the correctness and validity of experimental demonstration multiple linear regression model;
Inquiry time delay L in step A includes network delay and processing locality;
The network delay is using network transmission volume N as its metric;The processing locality reads number Y conduct using I/O block
Its metric;
The step B is made of following several parts:
B1: predicted query interaction;
B2: predicted query delay;
B3: the linear regression model (LRM) training based on sampling;
The B1 step includes: I/O block reading number Y of the main inquiry q in the case where concurrently being executed with secondary inquiry p1...pn with
And the prediction of network transmission volume N;Wherein I/O block reading number Y passes through following linear forecast of regression model:
Network transmission volume N passes through following linear forecast of regression model:
The step B2 predicts inquiry time delay L by following linear regression model:
L=Cq+β1*Bq+β2*Nq(3);
The step B3 are as follows: by providing 2 or more inquiries, and different inquiries are generated using stratified sampling function and are combined, and
The different inquiry combination of pairs of operation, I/O block when recording each inquiry combination read number Y and network transmission volume N and carry out group
At sample, the factor beta of linear regression model (LRM) is estimated by least square method using sample1, β2, β3And β4;
In formula, BqBased on inquire q I/O block read number;
I/O block for all secondary inquiries reads the sum of number;
The sum of number is read to the I/O block for directly affecting value of main inquiry for all secondary inquiries;
The I/O of indirect influence value between all secondary inquiries reads the sum of number fastly;
NqBased on inquire q network transmission volume;
For the sum of the network transmission volume of all secondary inquiries;
It is all secondary inquiries to the sum of the network transmission volume for directly affecting value of main inquiry;
The sum of the network transmission volume of indirect influence value between all secondary inquiries;
CqFor the CPU overhead time for inquiring q.
2. the performance prediction method that concurrent efforts load in distributed data base according to claim 1, it is characterised in that:
The step C are as follows: operation inquiry Q1, Q2, Q3 ... Qn obtains measured value in multiple linear regression model, then by measured value
It is put into multiple linear regression model and exports, obtain predicted value, the sampling of predicted value a part is divided into test data set, another part
It is divided into training dataset, and observes the fit solution between predicted value and measured value.
3. the performance prediction method that concurrent efforts load in distributed data base according to claim 1, it is characterised in that:
The network transmission volume N uses initial data of the network transmission package number between node as measurement query execution when.
4. the performance prediction method that concurrent efforts load in distributed data base according to claim 3, it is characterised in that:
The network transmission package number and I/O block are read number Y and are obtained using SystemTap.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510881758.5A CN105512264B (en) | 2015-12-04 | 2015-12-04 | The performance prediction method that concurrent efforts load in distributed data base |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510881758.5A CN105512264B (en) | 2015-12-04 | 2015-12-04 | The performance prediction method that concurrent efforts load in distributed data base |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105512264A CN105512264A (en) | 2016-04-20 |
CN105512264B true CN105512264B (en) | 2019-04-19 |
Family
ID=55720246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510881758.5A Active CN105512264B (en) | 2015-12-04 | 2015-12-04 | The performance prediction method that concurrent efforts load in distributed data base |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105512264B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107451041B (en) * | 2017-07-24 | 2019-11-22 | 华中科技大学 | A kind of object cloud storage system response delay prediction technique |
CN107679243A (en) * | 2017-10-31 | 2018-02-09 | 麦格创科技(深圳)有限公司 | Task distributes the application process and system in distributed system |
CN109308193B (en) * | 2018-09-06 | 2019-08-09 | 广州市品高软件股份有限公司 | A kind of multi-tenant function calculates the concurrency control method of service |
CN110210000A (en) * | 2019-04-18 | 2019-09-06 | 贵州大学 | The identification of industrial process efficiency and diagnostic method based on Multiple Non Linear Regression |
CN111782396B (en) * | 2020-07-01 | 2022-12-23 | 浪潮云信息技术股份公司 | Concurrency elastic control method based on distributed database |
CN112307042B (en) * | 2020-11-01 | 2024-09-17 | 宋清卿 | Database load analysis method for query intensive data storage processing system |
US11568320B2 (en) * | 2021-01-21 | 2023-01-31 | Snowflake Inc. | Handling system-characteristics drift in machine learning applications |
CN113157814B (en) * | 2021-01-29 | 2023-07-18 | 东北大学 | Query-driven intelligent workload analysis method under relational database |
CN113485638B (en) * | 2021-06-07 | 2022-11-11 | 贵州大学 | Access optimization system for massive astronomical data |
CN113296964B (en) * | 2021-07-28 | 2022-01-04 | 阿里云计算有限公司 | Data processing method and device |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609416B (en) * | 2009-07-13 | 2012-11-14 | 清华大学 | Method for improving performance tuning speed of distributed system |
CN101841565B (en) * | 2010-04-20 | 2013-07-31 | 中国科学院软件研究所 | Database cluster system load balancing method and database cluster system |
CN104794186B (en) * | 2015-04-13 | 2017-10-27 | 太原理工大学 | The acquisition method of database loads response time forecast model training sample |
-
2015
- 2015-12-04 CN CN201510881758.5A patent/CN105512264B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN105512264A (en) | 2016-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105512264B (en) | The performance prediction method that concurrent efforts load in distributed data base | |
Li et al. | A platform for scalable one-pass analytics using mapreduce | |
US11429584B2 (en) | Automatic determination of table distribution for multinode, distributed database systems | |
US10621064B2 (en) | Proactive impact measurement of database changes on production systems | |
US8190598B2 (en) | Skew-based costing for database queries | |
CN107908536B (en) | Performance evaluation method and system for GPU application in CPU-GPU heterogeneous environment | |
CN105512465B (en) | Based on the cloud platform safety quantitative estimation method for improving VIKOR methods | |
CN110377519B (en) | Performance capacity test method, device and equipment of big data system and storage medium | |
US10411969B2 (en) | Backend resource costs for online service offerings | |
Ortiz et al. | A vision for personalized service level agreements in the cloud | |
CN108733781A (en) | The cluster temporal data indexing means calculated based on memory | |
CN113157541B (en) | Multi-concurrency OLAP type query performance prediction method and system for distributed database | |
Vrbić | Data mining and cloud computing | |
Awada et al. | Cost Estimation Across Heterogeneous SQL-Based Big Data Infrastructures in Teradata IntelliSphere. | |
Molka et al. | Experiments or simulation? A characterization of evaluation methods for in-memory databases | |
Kusuma et al. | Performance comparison of caching strategy on wordpress multisite | |
Öztürk | Tuning parameters of Apache Spark with Gauss–Pareto-based multi-objective optimization | |
Tabet et al. | Towards a new data replication strategy in mongodb systems | |
Barajas et al. | Benchmarking parallel k-means cloud type clustering from satellite data | |
Hagedorn et al. | Cost-based sharing and recycling of (intermediate) results in dataflow programs | |
Rakhmawati et al. | On Metrics for Measuring Fragmentation of Federation over SPARQL Endpoints. | |
Burdakov et al. | Predicting SQL Query Execution Time with a Cost Model for Spark Platform. | |
Yu et al. | Performance studies of a websphere application, trade, in scale-out and scale-up environments | |
CN112182076A (en) | Variable selection method combining different source data | |
Wang et al. | Skew‐aware online aggregation over joins through guided sampling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |