CN116307352A

CN116307352A - Engineering quantity index estimation method and system based on machine learning

Info

Publication number: CN116307352A
Application number: CN202211380237.8A
Authority: CN
Inventors: 刘静; 刘在田
Original assignee: China Nuclear Huawei Engineering Design And Research Co ltd
Current assignee: China Nuclear Huawei Engineering Design And Research Co ltd
Priority date: 2022-11-05
Filing date: 2022-11-05
Publication date: 2023-06-23

Abstract

The invention relates to the technical field of machine learning and engineering cost, in particular to a method and a system for estimating engineering quantity indexes based on machine learning; the method comprises the following steps: acquiring project history data from a project management system, and constructing an original data set D according to the project history data ₀ Performing feature selection on an original data set by utilizing a mixed feature selection method to obtain an optimal feature subset S, building a basic regression model based on a plurality of machine learning algorithms, fully integrating the advantages of the multiple models, and building an integrated learning engineering quantity index estimation model; the invention mixes multiple feature selection methods, improves the prediction effect of the model, solves the problems of large data volatility and insensitivity of a single feature selection method to certain feature data, integrates multiple machine learning algorithms and improves the modelRobustness and accuracy of (c).

Description

Engineering quantity index estimation method and system based on machine learning

Technical Field

The invention relates to the technical field of machine learning and engineering cost, in particular to a method and a system for estimating engineering quantity indexes based on machine learning.

Background

As the real estate industry slows down, the building market also competes more and more, and the bidding period of time is shorter and shorter. The construction engineering quantity index estimation can provide important basis for budget quotation of enterprises, and whether the estimation is accurate or not can directly influence investment decisions of the enterprises. How to estimate engineering quantity indexes quickly and efficiently is particularly important to the improvement of technical level and core competitiveness of construction enterprises.

The traditional project quantity index prediction is performed by using artificial experience and project similarity matching, namely, the project quantity index of a new project is estimated by searching historical project data similar to the project profile of the project to be calculated, and a simple statistical analysis method and a linear regression method are mainly used in the prediction process.

Along with the development of big data and artificial intelligence, the prediction direction of the engineering cost is gradually developed from the traditional method to the information technology, and the engineering cost is also predicted based on Artificial Neural Network (ANN), BP neural network (BPNN) and other methods in China.

Therefore, for the current state of research, the present application CN114331221a has been filed to solve the above problems, but it is to study the estimation of the construction quantity index instead of the estimation of the price to exclude the interference of external factors. Engineering quantity index estimation has several problems; 1) The characteristics affecting the engineering quantity index estimation are numerous, the effective analysis and utilization are lacking, the existing research results mostly depend on human experience, and the data support is lacking; 2) Most of the existing engineering quantity prediction methods are based on a simple single method or model, but a strong nonlinear relation exists between engineering profile and engineering quantity indexes, so that the error of the existing research method is large, therefore, the application aims to provide an engineering quantity index estimation method based on machine learning on the basis of early research, and a plurality of feature choices are mixed to improve the prediction effect of the model, solve the problems that the data volatility is large and the single feature choice method is insensitive to certain feature data, integrate a plurality of machine learning algorithms, and improve the robustness and accuracy of the model.

Disclosure of Invention

In order to solve the problems, the invention provides a machine learning-based engineering quantity index estimation method and a machine learning-based engineering quantity index estimation system, which are mixed with various feature selections, so that the prediction effect of a model is improved, the problems that the data volatility is large and a single feature selection method is insensitive to certain feature data are solved, various machine learning algorithms are integrated, and the robustness and the accuracy of the model are improved.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the first aspect of the invention: the engineering quantity index estimation method based on machine learning comprises the following steps:

(1) Acquiring project history data from a project management system, and constructing an original data set D according to the project history data ₀ ；

(2) The method for selecting the characteristics of the original data set by utilizing the mixed characteristic selection method is used for selecting the characteristics of the original data set to obtain an optimal characteristic subset S, and the specific process is as follows:

(201) Constructing a feature selection dataset D on the basis of the original dataset ₁ ，D ₁ ＝{(X _ij ,y _ij )}，i,j＝1,2,…,n，X _ij Is the engineering profile difference value, y, between monomer i and monomer j _ij Representing the relative error of engineering quantity indexes between the monomer i and the monomer j;

(202) Removing linear related characteristic variables based on PCA algorithm to obtain characteristic subset S ₁ ；

(203) Calculating the maximum information coefficient MIC of every two variable features in the original feature set;

(204) Removing redundant features in the original feature variables according to the threshold value to obtain a feature subset S ₂ The specific process is as follows:

(2041) According to feature subsets S ₂ Constructing a random forest regression model by the feature numbers in the tree and the number of decision trees;

(2042) And (3) carrying out single feature importance assessment by using a random forest regression model, wherein the importance of the jth feature is as follows:

in the formula e _i E, evaluating the error value obtained by evaluating the j-th decision tree in the random forest regression model evaluation by using the out-of-bag data _ji The error value of the j decision tree is obtained after noise drying is introduced;

(2043) Sorting the importance of the features, and determining a feature screening threshold, wherein the formula of the feature screening threshold is as follows:

δ=min (M) +α, where M represents a featureSubset S ₂ The feature importance sets in (a) and alpha represents threshold tolerance;

(205) Computing feature subset S using random forest algorithm ₂ The importance of each feature in (a);

(206) Further screening the features according to the threshold value to obtain an optimal feature subset S;

(3) Based on a plurality of machine learning algorithms, a basic regression model is built, the advantages of a plurality of models are fully fused, and an integrated learning engineering quantity index estimation model is built;

the process for constructing the integrated learning engineering quantity index estimation model is as follows:

(301) Constructing a machine learning data set based on the optimal feature subset S obtained in the step (2), and dividing the data set into a training set and a testing set;

(302) Building a first-layer machine learning model, wherein the first-layer machine learning model comprises a BPNN model, an RFR model and a PSO-GRNN model;

(303) Training 4 basic learners respectively by adopting 4-fold cross validation, and longitudinally superposing predicted values of the 4 basic learners to obtain new features, and generating a new training set and a new testing set;

(304) And constructing a second-layer machine learning model based on the Ridge regression method, training the second-layer meta-regression model by using a new training set, and outputting a final prediction result.

The invention is further provided with: in the step (201) of the method,

where ρ represents the engineering quantity index fluctuation threshold.

The second aspect of the invention: the engineering quantity index estimation system based on machine learning comprises an optimal feature subset acquisition unit and an engineering quantity index estimation unit, wherein:

the optimal feature subset obtaining unit is used for interfacing with the project management system and obtaining optimal feature subset data;

the engineering quantity index estimation unit is used for taking the optimal feature subset as input, calculating to obtain the engineering quantity index for use by using the constructed engineering quantity index estimation model, and the input end of the engineering quantity index estimation unit is connected with the output end of the optimal feature subset acquisition unit.

Advantageous effects

Compared with the prior art, the technical proposal provided by the invention has the following advantages that

The beneficial effects are that:

(1) The invention provides a mixed multiple feature selection method based on the particularity of engineering project index data, effectively improves the prediction effect of the model, and solves the problems of large data volatility and insensitivity of a single feature selection method to certain feature data.

(2) According to the integrated learning engineering quantity index estimation method and system, a plurality of machine learning algorithms are synthesized, and two-layer algorithm models are utilized for comprehensive analysis and prediction, so that the robustness and accuracy of the models are improved, and the engineering quantity index prediction error is verified to be within 5%, so that accurate and effective data support can be provided for engineering earlier-stage project cost estimation.

Drawings

FIG. 1 is a flow chart of a machine learning-based engineering quantity index estimation method of the present invention;

FIG. 2 is a flow chart of hybrid feature selection in the present invention;

FIG. 3 is a schematic diagram of an integrated learning engineering quantity index estimation model according to the present invention;

FIG. 4 is a schematic diagram of a model of a BPNN-based learner in accordance with the present invention;

FIG. 5 is a flow chart of a PSO-GRNN based learner model in accordance with the present invention;

FIG. 6 is a system diagram of a machine learning based engineering quantity index estimation system according to the present invention;

FIG. 7 is a comparative table of the feature selection method of the present invention;

FIG. 8 is a table comparing GRNN and PSO-GRNN models in accordance with the present invention;

FIG. 9 is a comparative table of predictive model comparisons in the present invention;

FIG. 10 is a table comparing the prediction results of the integrated learning model according to the present invention;

FIG. 11 is a table of records of relevant factors in the present invention;

FIG. 12 is a table showing engineering quantity index records in the present invention.

The reference numerals in the figures illustrate:

100. an optimal feature subset acquisition unit; 200. and an engineering quantity index estimation unit.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments, and that all other embodiments obtained by persons of ordinary skill in the art without making creative efforts based on the embodiments in the present invention are within the protection scope of the present invention.

In the description of the present invention, it should be noted that the positional or positional relationship indicated by the terms such as "upper", "lower", "inner", "outer", "top/bottom", etc. are based on the positional or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "configured to," "engaged with," "connected to," and the like are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be the communication between the two elements; the specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Examples:

as shown in fig. 1-12, the invention provides a machine learning-based engineering quantity index estimation method, which comprises the following steps:

(1) Acquiring project history data from a project management system, and constructing an original data set D according to the project history data ₀ 。

In the present embodiment, the acquired project history data, including the monomer project profile and the project quantity index, is subjected to data cleaning and data preprocessing according to the existing method, the data cleaning includes processing of the repetition value, the missing value and the abnormal value, the data preprocessing is to convert the characteristic data types, and to perform normalization processing on the converted data, and then to construct the original data set D ₀ 。

In addition, in order to fully mine and utilize the historical project data, in this embodiment, the steel bar engineering quantity index is used as a prediction object, and 28 items of total monomer profile information including the region where the project is located, the standard layer height and the like are collected as an initial feature set. As a data preprocessing method, the present embodiment mainly includes: and (3) binarizing and dumb coding qualitative characteristics (such as basic types and project areas), and carrying out dimensionless processing on the data by a min-max method.

(201) Constructing a feature selection dataset D on the basis of the original dataset ₁ ，D ₁ ＝{(X _ij ,y _ij )}，i,j＝1,2,…,n，X _ij Is the engineering profile difference value, y, between monomer i and monomer j _ij Indicating the relative error of engineering quantity index between monomer i and monomer j,

where ρ represents an engineering quantity index fluctuation threshold value, and in this embodiment, ρ=0.05.

In an embodiment, taking into account the particularities of the engineering quantity index data (the sameThe engineering quantity index of the engineering profile monomer fluctuates within a certain range, and the characteristic selection error is large and the effect is poor by directly using the original data set. The invention constructs the feature selection data set D based on the original data set ₁ ；

In the embodiment, the implementation process of the PCA algorithm is not described in detail, and in the embodiment, after the PCA dimension reduction, the first 23 feature variables are taken as feature subsets S ₁ ；

in this embodiment, the feature subset S obtained by MIC feature screening ₂ The number of the characteristic variables is 18.

as an embodiment, one implementation is as follows:

in data set D ₁ Feature subset S ₂ Based on which data subset D is constructed ₂ As the input of the random forest regression model, the performance of each decision tree in the random forest model is evaluated by using the data outside the bag to obtain the error value of each decision tree, and the error value is recorded as e _i I=1, 2,3, …, n, adding noise disturbance to the j variable feature while ensuring that the remaining features are unchangedCalculating the error value of each decision tree again and marking as e _ji The importance of the j-th feature, i, j=1, 2,3, …, n, is:

δ=min (M) +α, where M represents feature subset S ₂ And α represents a threshold tolerance, in this embodiment, α=0.01.

in this embodiment, the optimal feature subset s= { above/below ground, standard layer height, single layer number, earthquake-proof intensity, project location area, fire-proof level, total layer height, structure type, earthquake-proof level, building area, foundation type, civil air defense duty }, for the steel bar engineering quantity prediction problem.

To further illustrate the advantages of the present invention, the effectiveness of five feature selection methods, PCA, MIC+PCA, MIC+RF, RF+PCA, MIC+RF+PCA, were compared based on the same predictive model.

In this embodiment, it should be noted that the specific comparison method is as follows:

1) Randomly selecting project data of 50 monomers as test data;

2) For five different feature selection methods, feature factors determined by the different feature selection methods are used as input respectively, an engineering quantity prediction model is constructed based on the same prediction method (BPNN algorithm is selected in the test example), training is carried out on the model by using training data, and then engineering quantity of the test data is predicted to obtain engineering quantity prediction values under the different feature selection methods.

3) Three evaluation indexes of MSE (mean square error), MAE (mean absolute error) and R2_score (determinable coefficient) of the predicted value and the true value under different feature selection methods are calculated, and the model performance is comprehensively evaluated and compared, and the result is shown in figure 7. MSE, MAE, R2_score are general calculation methods, and detailed calculation formulas are not repeated in the present invention.

From FIG. 7, it can be seen that the smaller the index values of MSE and MAE, the higher the prediction accuracy of the model; and the closer the value of R2 score is to 1, the better the fitting effect of the model is, and the higher the accuracy is. As can be seen from the comparison result of the embodiment, when the prediction models are consistent, the MSE and MAE index values of the mixed feature selection method based on PCA+MIC+RF are obviously smaller, and R2_score is higher than that of the other four methods, which proves that the mixed feature selection method provided by the invention can obviously improve the prediction effect of the prediction model.

(302) Building a first-layer machine learning model, wherein the first-layer machine learning model comprises three parallel basic learners, namely a BPNN model (back propagation neural network), an RFR model (random forest regression model) and a PSO-GRNN model (particle swarm-generalized regression neural network);

In the present embodiment, the machine learning data set d= { (X) _i ,y _i ) I=1, 2, …, n, where X _i E S represents the monomer characteristics of the ith monomer, y _i I monomer-representing workerA program quantity profile. And 80% of the dataset was used as training set and 20% as test set.

The method of each base learner in step (304) is as follows:

for the BPNN-based learner in this embodiment, the number of hidden layers is optimized based on grid search and cross validation by using the prior art to obtain the optimal super-parameters, where the hidden layers are three layers, the node numbers are 64, 128, 32 in sequence, and the model architecture is shown in fig. 4. The training process uses MSE as an error function, uses a gradient descent method and updates predictions based on learning rates, and finds a combination of parameters that minimizes network errors by means of the fastest gradient information.

For the RFR-based learner in this embodiment, the super parameters are optimized by using grid search and cross validation, wherein the number of basic decision trees is 200, the maximum depth of each decision tree is 50, and the RFR-based learner is constructed by using the searched optimal parameter combination based on training data.

It should be noted that, both grid search and cross validation belong to the mature parameter adjustment means, and the invention is not repeated. In this embodiment, the super parameters in the base learner are selected mainly by a combination of two methods.

For different base learners, the method comprises the following general steps:

1) Presetting several groups of base learner super-parameter combinations as candidate parameters;

2) Each set of hyper-parameter combinations is cycled through all candidate parameters and the model performance of each set of hyper-parameter combinations is evaluated based on a cross-validation approach.

Specifically, with respect to cross-validation, training data is split equally into 4 shares in this embodiment. And taking one data as a verification set and the rest 3 data as a training set each time, training and testing the model, and calculating the mean square error of the test data each time. And training for 4 times, testing for 4 times to obtain 4 times of test errors, and finally averaging the test errors to obtain the final test error of each group of super-parameter combinations.

According to the final test error of each group of super-parameter combinations, the super-parameter combination with the best performance, namely the smallest error, is selected as the optimal super-parameter combination.

In the PSO-GRNN-based learner in this embodiment, a three-layer GRNN network structure is first constructed, and optimization is performed on selection of smoothing factors in a GRNN model based on a PSO algorithm, and a specific optimization flow is shown in FIG. 5. The GRNN and PSO are mature algorithms, and the invention is not repeated, and only the optimization effect of the PSO algorithm on the GRNN network structure parameters is described. To illustrate the effectiveness of the method, the present example compares the model accuracy of GRNN and PSO-GRNN, and the results are shown in FIG. 8.

As can be seen from the comparison result of FIG. 8, compared with the basic GRNN method, the MSE and MAE index values of the PSO-GRNN method provided by the invention are smaller, and the R2_score is slightly higher, so that the effectiveness of the method for optimizing the GRNN model by using the PSO algorithm provided by the invention is proved.

Further, in this embodiment, the model accuracy of the ensemble learning model and BPNN, SVR, RFR, PSO-GRNN according to the present invention are compared, and the result is shown in fig. 9.

As can be seen from fig. 9, the performance of the integrated learning model according to the present invention is superior to that of the single base learner model in three evaluation indexes, namely MSE, MAE and r2_score.

Based on the test data set, the model prediction result is verified, part of the test result is shown in fig. 10, and the experimental result proves that the method disclosed by the invention can stably and accurately predict the content of the index, the prediction relative error is less than 5%, and the accuracy requirement of the early-stage project estimation is completely met.

As shown in fig. 6, the present invention further provides an engineering quantity index estimation system based on machine learning, which includes an optimal feature subset obtaining unit 100 and an engineering quantity index estimation unit 200, wherein:

the optimal feature subset obtaining unit 100 is configured to interface with the project management system and obtain optimal feature subset data;

the engineering quantity index estimation unit 200 is configured to take the optimal feature subset as input, calculate to obtain a corresponding engineering quantity index by using the constructed engineering quantity index estimation model, and an input end of the engineering quantity index estimation unit 200 is connected with an output end of the optimal feature subset obtaining unit 100.

The workflow of the method is as follows: the method is characterized in that a residential building in Jiangsu Changzhou is used as a specific implementation case for analysis, project characteristic factors (overground/underground, standard floor height, single floor number, earthquake fortification intensity, region where the project is located, fire resistance level, total floor height, structure type, earthquake resistance level, building area, foundation type and civil air defense ratio) of the residential building are input in the initial stage of the project, and specific relevant factors are shown in fig. 11. Through the input values, the engineering quantity index estimation model is utilized to obtain the predicted engineering quantity index, and the predicted engineering quantity index is compared with the actual engineering quantity index, wherein the prediction error of the engineering quantity index is within 5 percent (as shown in figure 12), so that accurate and effective data support is provided for engineering earlier project cost estimation.

The engineering quantity index estimation system is used for estimating engineering quantity indexes.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The engineering quantity index estimation method based on machine learning is characterized by comprising the following steps of:

wherein e is _i E, evaluating the error value obtained by evaluating the j-th decision tree in the random forest regression model evaluation by using the out-of-bag data _ji The error value of the j decision tree is obtained after noise interference is introduced;

(2043) Sorting the importance of the features, and determining a feature screening threshold, wherein the formula of the feature screening threshold is as follows: δ=min (M) +α, where M represents feature subset S ₂ The feature importance sets in (a) and alpha represents threshold tolerance;

(206) Further screening the features according to the threshold value to obtain an optimal feature subset S; (3) Based on a plurality of machine learning algorithms, a basic regression model is built, the advantages of a plurality of models are fully fused, and an integrated learning engineering quantity index estimation model is built; the process for constructing the integrated learning engineering quantity index estimation model is as follows:

2. The method for estimating an engineering quantity index based on machine learning according to claim 1, wherein, in step (201),

where ρ represents the engineering quantity index fluctuation threshold.

3. An engineering quantity index estimation system based on machine learning, characterized by comprising an optimal feature subset acquisition unit (100) and an engineering quantity index estimation unit (200), wherein:

the optimal feature subset obtaining unit (100) is used for interfacing with the project management system and obtaining optimal feature subset data; the engineering quantity index estimation unit (200) is used for taking the optimal feature subset as input, calculating to obtain the engineering quantity index for use by using the constructed engineering quantity index estimation model, and the input end of the engineering quantity index estimation unit (200) is connected with the output end of the optimal feature subset acquisition unit (100).