CN104794186B

CN104794186B - The acquisition method of database loads response time forecast model training sample

Info

Publication number: CN104794186B
Application number: CN201510171679.5A
Authority: CN
Inventors: 牛保宁; 张锦文
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2015-04-13
Filing date: 2015-04-13
Publication date: 2017-10-27
Anticipated expiration: 2035-04-13
Also published as: CN104794186A

Abstract

The acquisition method of database loads response time forecast model training sample, belongs to the sample collection method based on cluster, it includes（1）Obtain response data during each load isolated operation of database；（2）Obtain response data when database loads are run in pairs；（3）Calculate average page read times change；（4）According to average page read times change to this space clustering of bulk sample；（5）Fill sample and choose table；（6）Generate training sample.The present invention can reduce the number of samples of statistical model, and keep model accuracy and reduce model setting up cost.

Description

The acquisition method of database loads response time forecast model training sample

Technical field

It is to be applied to database loads response time forecast model the invention belongs to the sample collection method based on cluster Train acquisition method.

Background technology

In current parallel database system, the prediction load response time is extremely important, can help DBA Condition data storehouse parameter, the load of reasonable arrangement schedule parallel.

But due to being influenced each other between data base concurrency load（Interaction）Mechanism is extremely complex, traditional analytic type Model sets up process complexity, and prediction effect is bad.Therefore existing literature, which is mainly, sets up statistical model, to predict the response of load Time.Pass through sample collection, model training（Return）, the step of model evaluation three complete statistical model set up.The document of this respect Mainly there are [1] Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads[C] //Proc.of 2011 ACM SIGMOD Conference(SIGMOD’2011). Athens, Greece, 2011:337-348

[2] Ahmad M, Aboulanaga A,Babu S, et al. Modeling and Exploiting Query Interaction in Database Systems[C] //Proc.of the 17th Conference on Information and Knowledge Management (CIKM’2008).Napa Valley,US,2008:183-192

[3] Ahmad M, AboulanagaA,Babu S, et al. Qshuffler: Getting the Query Mix Right[C] //Proc. of the 24th International Conference on Data Engineering (ICDE’2008).Cancun, Mexico,2008:1415-1417

[4] Ahmad M, Duan S, Aboulanaga A, et al. Predicting Completion Times of Bath Query Workloads Using Interaction-aware Models and Simulation[C] // Proc.of the 14th International Conference on Extending Database Technology (EDBT’2011).Uppsala, Sweden,2011:449-460

[5] Ahmad M, Duan S, Aboulanaga A, et al. Interaction-aware Scheduling of Report Generation Workloads [J].The VLDB Journal,2011,20(4): 589-615

[6] Sheikh M B, Minhas U F, Khan O Z, et al. A Bayesian Approach to Online Performance Modeling for Database Appliances Using Gaussian Models[C] //Proc.of8th International Conference on Autonomic Computing(ICAC’2011).

Karlsruhe, Germany,2011:121-130。

But the corresponding method of sampling of above-mentioned statistical model does not account for influencing each other between load, only by full sample space Specific sampling or random sampling obtain sample.As database data amount increases, load running time increase, if not selected Training sample, can cause the model training time elongated, and the cost that model is set up will become very large.

The content of the invention

Cost is set up in order to reduce model, shortens model setup time, the present invention provides a kind of collection side of training sample Method, can be reduced model and sets up cost while model prediction accuracy is significantly reduced.

Technical scheme：The acquisition method of database loads response time forecast model training sample, including under State content：

1st, response data during each load isolated operation of database is obtained；

When i.e. each loads q isolated operations, its response time, CPU time, logic reading number, BAL values are obtained.Wherein BAL is the Buffer Access Latency values defined in [1], represents that Database Systems often complete a physics and read institute The average time used, this invention simply if referred to as read average time.Buffer Access Latency values derive from document Duggan J, Cetintemel U, Papaemmanouil O, et al. Performance Prediction for Concurrent Database Workloads //Proc.of 2011 ACM SIGMOD Conference(SIGMOD’ 2011). Athens, Greece, 2011:337-348

Load q is represented by loaded template C_qThe executable database loads of generation.

Loaded template is generated by the data base querying with parameter, renewal sentence；Different inquiry, update sentence and be considered as Different loaded templates.The different load of the parameter of same loaded template generation, is considered as same load.

2nd, response data when database loads are run in pairs is obtained；I.e. first load q_iWith the second load q_jOperation in pairs When, obtain respective response time, CPU time, logic reading number, BAL values；Wherein first load q_iWith the second load q_jRespectively Belong to two different loads templates（First loaded template C_qiWith the second loaded template C_qj）Generation.

3rd, average page read times change is calculated；

Average page read times change is by Δ T_{q_s}= T_{q_s}-T_qDefinition, T_{q_s}Represent the load of some in sample s q（By bearing Back(ing) board C_qGeneration）BAL values, T_qRepresent the BAL values of some load q isolated operation.

Average page read times change meets following formula simultaneously：

Wherein Δ T_q/cijRepresent some load q and another load c_ijIn pairs during operation, some load q BAL Value, another load c_ijIt is sample s_jIn by query template C_CiThe load of generation；ΔT_q/ciRepresent some load q with it is another Individual load c_iIn pairs during operation, some load q BAL values, another load c_iIt is by query template C in sample s_CiGeneration Load；

Utilize the Δ T obtained by paired operation_q/cTo calculate higher MPL（Multi Programming Level, data base set System is maximum and line number, i.e. expression are while the number of loads that can be run）Some load q Δ T under rank_{q_s}.Then under Formula provides Δ T_{q_s}Calculating：

；

4th, according to average page read times change to this space clustering of bulk sample；

For each class some load q, in given MPL ranks（Multi Programming Level）Under, to it All T_{q_s}Clustered, clustering method selects Kmeans algorithms, measures as Euclidean distance.Clusters number is MPL*2.

5th, filling sample chooses table；

6th, training sample is generated.

The present invention can reduce the number of samples of statistical model, and keep model accuracy and reduce model being created as This.

Embodiment

Embodiment：If it is q respectively to give 5 loadtypes₁、q₂、q₃、q₄、q₅；MPL grades are 4, and representing simultaneously can be in number It it is 4 according to the load number run in storehouse, current sample is s₀（q₁, q₂, q₃, q₄）.Wherein q₁、q₂、q₃、q₄、q₅Respectively by 5 Query template C_q1、C_q2、C_q3、C_q4、C_q5Generation, Database Systems are IBM DB2, and version number is 9.5.

1st, response data during each load isolated operation is obtained；The response data includes response time, CPU time, patrolled Collect and read number, BAL values T_q；

Isolated operation loads q₁、q₂、q₃、q₄、q₅And obtain the respective response time, the CPU time, logic read number, individually The BAL values of operation.Data are obtained by DB2 snapshots monitor command：“db2 get snapshot for dynamic sql on database”。

2nd, response data when load is run in pairs is obtained；By q₁、q₂、q₃、q₄、q₅Carry out permutation and combination, obtain it is all into To combination（10 operation loads in pairs）The paired operation response time, in pairs operation the CPU time, paired operation logic read Number, in pairs operation BAL values T_q/c.The acquisition modes of data equally use DB2 snapshot monitor commands.

3rd, average page read times change is calculated

Δ T is calculated by following formula_{q1_s0}Scope：

Current sample is s0（q₁, q₂, q₃, q₄）, MPL=4；The other MPL values of one-level lower than current MPL are 3, and it can be generated And include load q₁Sample have s₁（q₁、q₂、q₃）, s₂（q₁、q₂、q₄）, s₃（q₁、q₃、q₄）.

Then:

And：

。

Thus Δ T_{q1_s0}Calculated value can be given by：

Therefore deduce that Δ T_{q1_s0}Calculated value, Δ T_{q1_s0}Represent load q₁In sample s₀In average page read when Between change.

The average page read times change of other three class loads similar can also be drawn.

It is all to include q for MPL=4₁Sample have s₀（q₁, q₂, q₃, q₄）, s₄（q₁, q₂, q₄, q₅）, s₅（q₁, q₃, q₄, q₅）, s₆（q₁, q₂, q₃, q₅）.

Δ T is calculated respectively for each sample_{q1_s0}、ΔT_{q1_s4}、ΔT_{q1_s5}、ΔT_{q1_s6}.Then this four values are carried out Kmeans is clustered.

In actual production environment, due to loadtype up to more than 20, MPL grades are therefore right between 30-200 In each loadtype q, and under given MPL grades, many samples for including q can be obtained.And to Δ T_{q_s}Set is carried out Kmeans is clustered, and clusters number is typically chosen to be MPL*2.

5th, filling sample chooses table

The sample s selected to each cluster, its each load included has the numerical value of a sign classification.

For example in s₀（q₁, q₂, q₃, q₄）In, it is a kind of possible for classification results K_s0（3,1,7,4）, represent Δ T_{q1_s0} It is the 3rd class, Δ T in full sample space_{q2_s0}For the first kind, Δ T_{q3_s0}For the 7th class, Δ T_{q4_s0}For the 4th class.

There is corresponding classification results K to each sample s_s。

We obtain following form by cluster

According to above classification results, fill following sample and choose table：

Herein, due in example contained loadtype it is few, have some vacancies in sample table.In actual production, there is one A little positions can be clashed, and cause some positions not fill.Random fashion can be degenerated to again by running into such case, and combination does not have There is the position of filling.

6th, training sample is generated

Sample chooses table according to obtained by the 5th step, is exactly required model training sample.

Following filling algorithm is provided in the present invention：

Input：Loaded template C, MPL=M；

Output：Selected sample set SampleSeled；

1、SampleSpace = GenerateSampleSpace(M,C)；

2nd ,/* generation bulk sample this space S ampleSpace */

3、For S_j∈SampleSpace

/ * calculates the Δ T of each loadtype in each sample_{q_s}*/

4、ComputeDIF_BAL(S_j)；

5、End For

6、For i = 1 to C, S_j∈SampleSpace

/ * is to each loadtype q_iWhole Δ T_{qi_Sj}Clustered, the number of cluster for M*2*/

7、Kmeans(q_i,ΔT_{qi_Sj},M*2)；

8、End For

9、For S_j∈SampleSpace

The Mu values for inserting mutual exclusion number Mu, sample s that/* calculates each sample are defined as：Sample s is inserted at first, for Other samples of SampleSpace, the total sample number * that can not be further filled with/

10、ComputeMutual(S_j)；

11、End For

12、Sort(Mu_j)；

The Mu values of/* according to each sample, from small to large ordered samples space */

13、MaxInsNum = 1；

/ * initialization maximum sample number of fills */

14、For i = 1 to K

/ * K be filling circulation number of times */

15、InsertS(S_j)；

/ * inserts sample S at first_j*/

16、InsertNum = 1；

17、For m = j+1 to SampleSpace

18、If(IsInsertS(S_m))

/ * judges S_mWhether can insert */

19、InsertS(S_m)；

20、InsertNum++；

21、End For

/ * insert successively other can insert sample */

22、If(InsertNum>MaxInsNum)

23、MaxInsNum = InsertNum；

24、RecordInsertS()；

If this cyclic pac king of/* is likely larger than existing program, preserve current filling sample */

25、End For

26、RandomInsertS()；

The room that/* is not inserted for other, random combine sample */.

Claims

1. the acquisition method of database loads response time forecast model training sample, comprises the steps：

（1）Obtain response data during each load isolated operation of database；

（2）Obtain response data when database loads are run in pairs；

（3）Calculate average page read times change；

Average page read times change is by Δ T_{q_s}= T_{q_s}-T_qDefinition, T_{q_s}Represent the BAL values of load q in sample s, T_qRepresent negative Carry q isolated operation BAL values；

And average page read times change meets following formula：

Wherein Δ T_q/cijRepresent some load q and another load c_ijIn pairs during operation, some load q BAL values, separately One load c_ijIt is sample s_jIn by query template C_CiThe load of generation；ΔT_q/ciRepresent that some load q is loaded with another c_iIn pairs during operation, some load q BAL values, another load c_iIt is by query template C in sample s_CiThe load of generation；

Utilize the Δ T obtained by paired operation_q/cCome calculate the maximum parallel several levels of higher MPL Database Systems not it is lower some load q Δ T_{q_s}, Δ T is then given by the following formula_{q_s}Calculating：

MPL is represented while the number of loads that can be run；

BAL represents that Database Systems often complete a physics and read used average time；

（4）According to average page read times change to this space clustering of bulk sample；

（5）Fill sample and choose table；

（6）Generate training sample.