CN117763356A - Rapid earthquake phase identification method based on LightGBM algorithm - Google Patents
Rapid earthquake phase identification method based on LightGBM algorithm Download PDFInfo
- Publication number
- CN117763356A CN117763356A CN202311804932.7A CN202311804932A CN117763356A CN 117763356 A CN117763356 A CN 117763356A CN 202311804932 A CN202311804932 A CN 202311804932A CN 117763356 A CN117763356 A CN 117763356A
- Authority
- CN
- China
- Prior art keywords
- model
- data
- seismic
- lightgbm
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000012549 training Methods 0.000 claims abstract description 54
- 208000035126 Facies Diseases 0.000 claims abstract description 18
- 238000003384 imaging method Methods 0.000 claims abstract description 13
- 238000012795 verification Methods 0.000 claims abstract description 8
- 238000004519 manufacturing process Methods 0.000 claims abstract description 7
- 238000002360 preparation method Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 3
- 238000003066 decision tree Methods 0.000 claims description 43
- 238000002790 cross-validation Methods 0.000 claims description 10
- 238000013145 classification model Methods 0.000 claims description 6
- 238000002372 labelling Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 24
- 230000006870 function Effects 0.000 description 19
- 238000010276 construction Methods 0.000 description 9
- 238000013138 pruning Methods 0.000 description 8
- 238000010586 diagram Methods 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 7
- 238000012360 testing method Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 6
- 238000009826 distribution Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000010354 integration Effects 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 238000005553 drilling Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000003208 petroleum Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 229920002430 Fibre-reinforced plastic Polymers 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 239000012267 brine Substances 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 239000003245 coal Substances 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000008021 deposition Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000011151 fibre-reinforced plastic Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 229910052500 inorganic mineral Inorganic materials 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 239000011707 mineral Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012764 semi-quantitative analysis Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- HPALAKNZSZLMCH-UHFFFAOYSA-M sodium;chloride;hydrate Chemical compound O.[Na+].[Cl-] HPALAKNZSZLMCH-UHFFFAOYSA-M 0.000 description 1
- 239000000243 solution Substances 0.000 description 1
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of seismic facies identification, in particular to a rapid seismic facies identification method based on a LightGBM algorithm, which comprises the following steps: s1, manufacturing a seismic phase classification label on a seismic imaging data training set, and then preprocessing and expanding the data set to finish the preparation of the data set and the label; s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a prediction set, wherein the training set is used for training a LightGBM model, and the verification set is used for evaluating the performance of the model.
Description
Technical Field
The invention relates to the technical field of seismic facies identification, in particular to a rapid seismic facies identification method based on a LightGBM algorithm.
Background
In the exploration and development of underground resources such as oil gas, coal mine, brine mineral products and the like, the research of the deposition environment has very important significance.
Conventionally, the sedimentary facies are usually studied by using cores or outcrops, however, in the vast area without outcrops, the sedimentary facies of a target layer can only be observed through core data, but because the coring rate of a well is low, generally several percent to tens percent, and the drilling coring process is not continuous, even if the phase analysis of a single well is sufficient and accurate, the sedimentary facies space spread characteristics of a region are difficult to reflect, and the sedimentary facies distribution of a work area needs to be sufficiently dense, but in practical production, the sedimentary facies distribution of a work area is very difficult to meet, so a new technology for better obtaining the spread characteristics of the sedimentary facies in the area by only needing a small amount of drilling core data needs to be developed, and the seismic facies analysis technology is generated to solve the problems.
With the deep exploration degree, the energy industry faces new situations and challenges, so that the seismic phase analysis technology is required to develop in a high-efficiency and fine direction, and the qualitative or semi-quantitative analysis of the manually-guided seismic phase in the prior art has insufficient reliability of data analysis due to subjectivity and unreliability of manual experience, so that the current complex production and development requirements cannot be met.
Disclosure of Invention
The invention aims to provide a rapid earthquake phase identification method based on a LightGBM algorithm, which aims to solve the problems in the background art.
The aim of the invention can be achieved by the following technical scheme:
a rapid earthquake phase identification method based on a LightGBM algorithm comprises the following steps:
s1, manufacturing a seismic phase classification label on a seismic imaging data training set, and then preprocessing and expanding the data set to finish the preparation of the data set and the label;
s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a prediction set, wherein the training set is used for training the LightGBM model, and the prediction set is used for evaluating the performance of the model.
S3, adjusting the super parameters of the LightGBM model in a cross-validation mode, fitting a prediction model based on a training set by using the optimal super parameter combination, and comprehensively evaluating the performance of the model by utilizing the data of the prediction set to obtain an optimal classification model based on the current task;
s4, inputting the data set for verification into a prediction model to obtain an automatic classification result of the earthquake phase, and completing automatic classification of the earthquake item.
Preferably, in the step S1, the expanding the data set includes the following steps:
s11, classifying the data to be detected by adopting a manual interpretation mode in combination with the existing seismic interpretation method;
s12, selecting a maximum probability value and comparing the maximum probability value with a preset probability threshold lambda, if not smaller than lambda, outputting a seismic phase class corresponding to the maximum probability value as a recognition result, otherwise, performing data expansion;
s13, labeling the classification result to the data to be detected, and then adding the classification result to a sample data set;
and S14, maintaining the original model by using the new sample data so as to realize the identification of the new seismic phase type.
Preferably, in the step S1, when the seismic phase classification label is manufactured, the seismic phase classification is classified into 9 classes, and some of the noise data with low correlation is removed by using a feature selection manner, so as to improve the model training efficiency and generalization capability.
Preferably, the LightGBM algorithm is a gradient lifting decision tree model, the feature selection adopts an embedded feature selection mode, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.
Preferably, in the calculation method of the importance of the LightGBM feature, the global contribution rate of the feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows:
wherein T represents decision trees, and M is the number of decision trees. The contribution rate of the feature j in a single tree is as common as
Formula 3-4:
where L is the number of leaf nodes of the tree, L-1 is the number of non-leaf nodes, vt is a feature associated with node t,is the reduced value of the square loss after node t splits.
The invention has the beneficial effects that:
the invention can develop the earthquake phase automatic quantitative analysis technology by means of strong computer computing capability, is beneficial to improving the working efficiency of the interpretation of petroleum geophysical prospecting data, shortening the working period and reducing the working cost, is beneficial to reducing subjectivity and unreliability of manual experience, enhancing the reliability of data analysis, and improving the capability and application effect of the petroleum industry for solving the complex exploration and development problem.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort;
FIG. 1 is a schematic flow chart of the present invention;
FIG. 2 is a sample of input training data and tag data in accordance with the present invention;
FIG. 3 is seismic facies interpretation tag data corresponding to the sample data of FIG. 2 in accordance with the present invention;
FIG. 4 is a graph of training loss function change of the LightGBM model according to the invention;
FIG. 5 is a plot of experimental seismic imaging profile data for use in the present invention;
FIG. 6 is a graph of the automatically predicted seismic phase classification results using the LightGBM method of the present invention;
FIG. 7 is a schematic cross-validation diagram in accordance with the present invention;
FIG. 8 is a schematic diagram of a decision tree model in the present invention;
FIG. 9 is a schematic diagram of ensemble learning in accordance with the present invention;
FIG. 10 is a schematic illustration of a GBDT according to the present invention;
FIG. 11 is a schematic diagram of a histogram algorithm in the present invention;
FIG. 12 is a schematic diagram of a layer-by-layer growth strategy in the present invention;
FIG. 13 is a schematic of a leaf-by-leaf growth strategy in the present invention;
fig. 14 is a schematic diagram of a feature rotation technique in the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention aims to realize an automatic interpretation of a seismic phase of seismic imaging data based on a lightweight gradient lifting algorithm (LightGBM) method. To achieve the above object, the present invention adopts the steps of:
on the training set of the seismic imaging data (shown in figure 2), the making of the seismic phase classification labels (shown in figure 3) is carried out based on expert experience knowledge, and then the data set is preprocessed and expanded to prepare the data set and labels.
According to the following steps: 4, splitting the seismic imaging data and the tag data set into a training set and a testing set, wherein the training set is used for the training process of the LightGBM model, and the prediction set is used for evaluating the performance of the model.
The LightGBM model hyper-parameters are adjusted by Cross Validation (CV). And (3) fitting a prediction model based on a training set by using the optimal super-parameter combination, and comprehensively evaluating the performance of the model by utilizing the prediction set data to obtain an optimal classification model based on the current task.
Inputting the data set for verification into a prediction model to obtain an automatic classification result of the earthquake phase. Automatic classification of the seismic items based on machine learning is completed.
Unlike the conventional method of creating disposable data set, the present invention completes the creation and dynamic maintenance of data set gradually in dynamic expansion mode, and the main purpose of data expansion is to raise the classification capacity of model continuously and reduce the continuous influence of the deficiency of early sample data on subsequent training.
Firstly, classifying data to be tested by adopting an expert interpretation mode in combination with a traditional seismic interpretation method;
step two, selecting a maximum probability value and comparing the maximum probability value with a preset probability threshold lambda, if not smaller than lambda, outputting a seismic phase class corresponding to the maximum probability value as a recognition result, otherwise, performing data expansion;
labeling the classification result to the data to be detected, and then adding the classification result to a sample data set;
and fourthly, maintaining the original model by using new sample data, so that the model has the capability of identifying new seismic phase types.
In view of the difference of reflection characteristics of the seismic imaging, the possible seismic phases are classified into 9 categories, and some of the possible seismic phases belong to noise data with low correlation, and the possible seismic phases need to be removed by means of characteristic selection so as to improve model training efficiency and generalization capability.
Because the LightGBM algorithm adopted by the invention is a gradient lifting decision tree model, an embedded feature selection mode is selected, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.
In the LightGBM feature importance calculation method, the global contribution rate of a feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows:
wherein T represents decision trees, and M is the number of decision trees. The contribution rate formula of the feature j in the single tree is as follows:
where L is the number of leaf nodes of the tree, L-1 is the number of non-leaf nodes, vt is a feature associated with node t,is the reduced value of the square loss after node t splits.
The experimental environment for the model construction of the invention is shown in the following table one:
table-experiment environment configuration table
First, according to 6:4, dividing the seismic phase classification data set into a training set and a testing set, wherein the training set is used for the training process of the LightGBM model, and the prediction set is used for evaluating the performance of the model. And secondly, adjusting the super parameters of the LightGBM model by means of Cross Validation (CV). And thirdly, fitting a prediction model based on a training set by using an optimal super-parameter combination, and comprehensively evaluating the performance of the model by using prediction set data to obtain an optimal model based on the current task.
(1) Super parameter optimization
In machine learning, a superparameter is a parameter used to control the learning process, and is usually determined before learning, unlike other parameters that are derived through a training process. In machine learning model construction, the values of the hyper-parameters typically need to be adjusted based on the data set, common methods include grid searching, bayesian optimization, heuristic searching, and random searching. Because the random search method has higher algorithm efficiency in the multi-super-parameter optimizing task, the research determines the optimal super-parameter combination based on the method.
Meanwhile, for objectively judging the coincidence degree of the training parameters to the data outside the training set, 5-fold cross validation is generally adopted to carry out super-parameter configuration, as shown in fig. 7, the original training set is randomly divided into 5 subsets with equal size, 1 subset is used as a validation subset, the other 4 subsets are used as training subsets, and model searching is carried out on the basis of the validation subsets. The above process is repeated 5 times until each subset is taken as a verification set, and finally, the optimal super-parameter combination is determined by combining the average precision of 5 times of parameter searching. The results of this experiment are shown in Table II:
meter two optimal super parameter combination table
(2) Model evaluation
In general, the process of machine learning model construction is actually a process of minimizing a loss function through parameter adjustment. In the Multi-classification problem, a Multi-log loss function (the formula is as follows) is generally selected as a standard for measuring the prediction capability of the model, the loss function gradually converges with the model optimization process (as shown in fig. 4), and the loss function converges to a certain degree without decreasing, namely, the loss function indicates that the model optimization is completed.
Wherein n is the number of predicted samples; m is the number of species; y is i,j Is a true class, if the ith sample belongs to the jth class, y i,j =1;p i,j Is the probability that the ith sample belongs to the jth class in the model prediction result.
Experimental results
The experimental data discloses test data for identifying the earthquake phase for SEG, the data comprises a plurality of typical earthquake phases, the flow of the application of the method is shown in figure 1, on a training set of earthquake imaging data, the manufacture of earthquake phase classification labels is carried out based on expert experience knowledge, then the data set is preprocessed and expanded, and the preparation of the data set and the labels is carried out.
According to the following steps: 4, splitting the seismic imaging data and the tag data set into a training set and a testing set, wherein the training set is used for the training process of the LightGBM model, and the prediction set is used for evaluating the performance of the model.
The LightGBM model hyper-parameters are adjusted by Cross Validation (CV). And (3) fitting a prediction model based on a training set by using the optimal super-parameter combination, and comprehensively evaluating the performance of the model by utilizing the prediction set data to obtain an optimal classification model based on the current task.
The data set for verification is input into a prediction model (shown in fig. 5) to obtain an automatic classification result of the seismic phase (shown in fig. 6). Automatic classification of the seismic items based on machine learning is completed.
In the present invention, the related art used is as follows:
1. lightweight gradient lifting algorithm LightGBM method principle
1.1 decision Tree Algorithm
The LightGBM method is essentially a method developed from the traditional decision tree method, so that the basic principles of decision trees and the like are introduced before the LightGBM method is introduced.
Decision Tree (Decision Tree) is a classical Tree-based machine learning algorithm. The method is considered to be an effective tool for solving the classification problem because of the advantages of high calculation efficiency, strong interpretation after the joint and the like. In general, a decision tree contains a root node, a number of child nodes and leaf nodes. The root node and each child node correspond to an attribute test, and the samples are sequentially divided into the child nodes of the next layer according to the attribute test result. The method is continuously recursion until the samples in the nodes all belong to the same class or cannot be divided, the nodes are called leaf nodes and correspond to a decision result at the same time, a decision tree model is shown in fig. 8, xi represents various attributes, a, b and c represent decision thresholds of the various attributes, and A, B, C, D represents different decision results.
In summary, the construction of the decision tree model is a top-down process, and each construction mainly comprises two steps of feature selection and decision tree pruning.
The selection steps of the attribute features are as follows:
the process of decision tree growth, namely the process of generating new leaf nodes by splitting leaf nodes, wherein the splitting represents splitting of a sample data set in a current node, and whether proper features can be efficiently selected as splitting attributes is an important reference standard for measuring the quality of a decision tree algorithm. At present, various decision tree algorithms mainly adopt Entropy (Entropy) and a coefficient of keni (Gini) as classification indexes to judge the degree of confusion inside the features, so as to find the growth direction of the decision tree which can promote the purity of samples in nodes to the highest degree, wherein the degree of confusion and the purity of information are two opposite concepts, and the two concepts are descriptions of state numbers (sample types) in a system (collection). The greater the number of sample types, the greater the degree of confusion and the lower the purity and vice versa.
Assuming that the discrete random variable X obeys the probability distribution of the following formula, let pi denote the probability of variable i:
P(X=x i )=p i ,i=1,2,…,n
the mathematical expression of entropy is as follows:
the formula for the calculation of the coefficient of kunning is as follows:
in summary, the entropy and the coefficient function are similar, and can be used as the measurement index of the chaotic degree of the characteristic data. When the fewer the categories, i.e., the higher the data purity, the lower the entropy and the coefficient of kunning; the more categories, i.e., the lower the data purity, the higher the entropy and the coefficient of kunning.
The pruning step of the decision tree is as follows:
during machine learning, the fitting phenomenon often occurs, that is, the objective function excessively depends on the training sample set, and even each sample (including noise) is fitted into the function, so that the fitting phenomenon is excellent only in the training set, and for the situation that an unknown sample cannot be predicted correctly, if a model is constructed only through a feature selection mode, the fitting problem occurs during the generation process of the model, and pruning processing is needed for the decision tree model.
Pruning of decision trees is generally divided into pre-pruning and post-pruning. The pre-pruning is limited according to a preset threshold value such as the maximum decision tree depth, the minimum sample number and the like in the decision tree construction process, and the splitting is stopped when the training process reaches a threshold value condition; and the post pruning is to firstly generate a complete decision tree from the training set and test the non-leaf nodes from bottom to top. When the subtree corresponding to the node is replaced by the leaf node, the improvement of the generalization performance can be brought, and the subtree is replaced by the leaf node.
1.2 Integrated learning
The ensemble learning (Ensemble Learning) refers to a process of performing a learning task by constructing and combining a plurality of homogenous learners (as shown in fig. 9) according to a certain integration strategy, wherein the homogenous learners represent a basic classification model belonging to the same type, such as a neural network, a decision tree and other traditional supervision classification models, and the learners are also called as "base learners", so that an algorithm model with better prediction performance than a single learner can be obtained through the combination of the plurality of base learners, and currently, the ensemble learning can be divided into two major categories of Boosting algorithm (Boosting) and Bagging algorithm (Bagging) according to the different integration strategies.
(1)Boosting
Boosting integration strategies represent the process of multiple weak learners generating strong learners in a serial manner. The weak learner is a learner with performance slightly better than that of random guessing, the strong learner is a learner with accurate prediction capability, and the basic idea of Boosting is to accumulate a plurality of basic learners layer by layer and connect the basic learners in a serial mode, and the sample distribution trained by each basic learner layer is determined by the classification condition of the previous layer: the samples with wrong classification are weighted more and more attention is paid to the next training. And repeating the process until the number of the basic learners reaches the preset number T, and finally outputting the above T basic learners in series to obtain the strong learner. In summary, boosting strategies are focused on reducing the sample data bias of the ensemble learner, and can be implemented as a strong learner by integrating multiple weak learners.
(2)Bagging
Different from the serial generation mode of Boosting, bagging trains in a parallel mode, and each base classifier adopts the same training sample, so that strong dependency relationship among the base classifiers does not exist. However, the classification results tend to be different due to the differences in learning ability between the respective classifiers. Therefore, the final prediction result is finally obtained in a voting mode, and from the perspective of deviation-variance, the Bagging method is more focused on the reduction of sample variance in the ensemble learning, and the ensemble learning is carried out by adopting a Bagging strategy through an algorithm which is usually easy to be disturbed by the sample, so that the prediction effect is obviously improved.
1.3 gradient-lifting decision Tree
The gradient Boosting decision tree (Gradient Boosting decision tree, GBDT) algorithm is an iterative decision tree algorithm that uses the ensemble-learning Boosting concept. Generally, GBDT is a decision tree based classifier that uses the negative gradient of the loss function as an approximation to the lifting tree residual to perform an algorithmic implementation.
The mathematical description formula of the lifting tree fM (x) is as follows:
wherein Tm (x) is a weak learner, i.e., a decision tree; γm is the weight of the best fit for each weak learner; m is the number of trees, i.e., the number of iterations.
Model training is the process of minimizing the loss function L. Assuming that the data size of the training sample is N, and the variable and the true value of the ith data are xi and yi respectively, the formula of the objective function of parameter tuning is as follows:
in the method, in the process of the invention,representing a predictive model of the completion of the training; l is the loss function during training.
In connection with fig. 10, the gbdt algorithm flow is summarized as follows:
initializing a weak learner to obtain an initial prediction model f0 (x):
wherein L is a loss function; gamma is the weak learner model weight.
For each iteration m=1, 2, …, M, a negative gradient is calculated, i.e. the residual rim:
will (x) i ,r im ) As training data of the next decision tree, fitting to obtain a new decision tree f m (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite The corresponding leaf node set is Rjm (j=1, 2, …, J), and J is the number of leaf nodes. Calculating a best fit value for the above aggregate range:
updating the regression tree:
where I is an indicator function, 1 when x belongs to the leaf node set Rjm, and 0 otherwise.
Outputting the final model fM (x):
and constructing the GBDT model according to the steps and applying the GBDT model to the process of classifying the problems, and measuring the performance of the model according to the probability distribution condition predicted by the model after the predicted result is obtained. It is common practice to employ a normalized exponential function (Softmax).
Softmax is a generalization of logic functions on multi-classification tasks, whose purpose is to reveal multi-classification results in the form of probabilities. If in D T Representing a sample training set, D T ={(x i ,y i ),i=1,…,n T }. Wherein x is i Is the characteristic data input by the model; y is i Is the corresponding class label. Assuming that the number of kinds of different classification units of the training set is K, n is generally the same as T >K. On the classification problem, the role of the LightGBM is to calculate the mapping function f R between xi and yi 15 →R K . For the input x, outputting a P-dimensional feature vector v, and substituting the P-dimensional feature vector v into a Softmax function to calculate a classification probability value:
wherein p is k Representing the predictive probability value belonging to the kth class, the sum of the predictive probabilities of each class must be 1 for any data x, as known from the Softmax equation.
1.4 lightweight gradient lifting algorithm
The lightweight gradient lifting algorithm (Light Gradient Boosting Machine, lightGBM) is a core method based on the present invention, and is a fast, high-performance algorithm based on GBDT framework, published in 2017 by microsoft asia institute on the gate open website, and features are selected and split in GBDT by adopting a pre-ordering manner, so that the splitting point can be precisely determined, but a large amount of memory and time are consumed at the same time. The LightGBM can employ a histogram-based approach and a Leaf-by-Leaf growth (Leaf-wise) strategy with depth limitation to increase training speed and reduce memory consumption.
The basic idea of the histogram algorithm employed by LightGBM (fig. 11) is to build a histogram on the basis of discretizing continuous floating point eigenvalue data into S bins ("buckets"). The above discrete values are used as indexes in the process of traversing the data, statistics (the number of samples in each bucket) are accumulated into corresponding histograms, and then all the discrete values are traversed to find the optimal partition point. The advantages of the histogram algorithm mainly include: (1) The algorithm operation efficiency is improved, the calculation time is reduced, so that the algorithm time complexity is reduced from O (N) to O (S), namely, tasks which originally need to be completed by N times of calculation can be completed by only S times, and S is smaller than N; (2) The memory occupation of the data is reduced, the continuous values are discretized by utilizing a bucket (bin), and the training data can be stored by using smaller data types under the condition that the bin value is smaller.
The growth of the decision tree in the LightGBM is different from the layer-by-layer growth (Level-wise) strategy (fig. 12) adopted by other decision tree algorithms, but adopts a Leaf-by-Leaf growth (Leaf-wise) strategy (fig. 13), wherein in the decision tree generation process, when the nodes are split, a whole layer of new Leaf nodes are indiscriminately generated, and the obtained decision tree model algorithm has lower complexity but causes memory consumption of a large number of invalid or low-efficiency nodes. The Leaf-wise strategy selects the nodes with the biggest benefits according to entropy or a coefficient of a radix, so that analysis and operation efficiency is greatly improved, and in addition, the biggest growth depth limit is required to be added on the basis of Leaf-wise in consideration of the fact that the strategy possibly causes an overfitting problem due to overlarge depth of a decision tree.
1.5 model feature selection
Referring to fig. 14, fig. 14 is a schematic diagram of a feature rotation technique in the present invention, and feature selection is also called attribute selection, and the basic idea is to select n (n < m) features suitable for model construction according to a certain standard from m features in a dataset, and not every feature contributes to model construction in application scenarios of multidimensional data such as natural language identification, medical gene diagnosis, remote sensing image processing, etc. The large number of redundant features are more likely to cause dimension disasters, so that the complexity of the model is multiplied, and the construction efficiency and accuracy of the model are seriously affected.
According to the difference of the interaction modes of the classifier, the feature selection is mainly divided into 3 types of filtering type, wrapping type and embedded type.
The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.
Claims (5)
1. The rapid earthquake phase identification method based on the LightGBM algorithm is characterized by comprising the following steps of:
s1, manufacturing a seismic phase classification label on a seismic imaging data training set, and then preprocessing and expanding the data set to finish the preparation of the data set and the label;
s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a verification set, wherein the training set is used for training the LightGBM model, and the verification set is used for evaluating the performance of the model.
S3, adjusting the super parameters of the LightGBM model in a cross-validation mode, fitting a prediction model based on a training set by using the optimal super parameter combination, and comprehensively evaluating the performance of the model by utilizing the data of the prediction set to obtain an optimal classification model based on the current task;
s4, inputting the data set for verification into a prediction model to obtain an automatic classification result of the earthquake phase, and completing automatic classification of the earthquake item.
2. The method for rapidly identifying a seismic facies based on the LightGBM algorithm according to claim 1, wherein in the step S1, the expansion of the data set comprises the following steps:
s11, classifying the data to be detected by adopting a manual interpretation mode in combination with the existing seismic interpretation method;
s12, selecting a maximum probability value and comparing the maximum probability value with a preset probability threshold lambda, if not smaller than lambda, outputting a seismic phase class corresponding to the maximum probability value as a recognition result, otherwise, performing data expansion;
s13, labeling the classification result to the data to be detected, and then adding the classification result to a sample data set;
and S14, maintaining the original model by using the new sample data so as to realize the identification of the new seismic phase type.
3. The method for quickly identifying the seismic facies based on the LightGBM algorithm according to claim 2, wherein in the step S1, when the seismic facies classification label is made, the seismic facies classification is classified into 9 classes, and some of noise data belonging to low correlation are removed by using a feature selection manner, so as to improve the training efficiency and generalization capability of the model.
4. The method for quickly identifying the seismic facies based on the LightGBM algorithm according to claim 3, wherein the LightGBM algorithm is a gradient lifting decision tree model, the feature selection adopts an embedded feature selection mode, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.
5. The method for quickly identifying a seismic facies based on the LightGBM algorithm according to claim 4, wherein in the method for calculating the importance of the LightGBM features, the global contribution rate of the feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows:
wherein T represents decision trees, and M is the number of decision trees. The contribution rate formula of the feature j in the single tree is as follows:
where L is the number of leaf nodes of the tree, L-1 is the number of non-leaf nodes, vt is a feature associated with node t,is the reduced value of the square loss after node t splits.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311804932.7A CN117763356A (en) | 2023-12-26 | 2023-12-26 | Rapid earthquake phase identification method based on LightGBM algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311804932.7A CN117763356A (en) | 2023-12-26 | 2023-12-26 | Rapid earthquake phase identification method based on LightGBM algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117763356A true CN117763356A (en) | 2024-03-26 |
Family
ID=90323534
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311804932.7A Pending CN117763356A (en) | 2023-12-26 | 2023-12-26 | Rapid earthquake phase identification method based on LightGBM algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117763356A (en) |
Citations (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413494A (en) * | 2019-06-19 | 2019-11-05 | 浙江工业大学 | A LightGBM Fault Diagnosis Method Based on Improved Bayesian Optimization |
DE202020101012U1 (en) * | 2020-02-25 | 2020-03-08 | Robert Bosch Gmbh | Device for predicting a suitable configuration of a machine learning system for a training data set |
CN110889308A (en) * | 2018-09-07 | 2020-03-17 | 中国石油化工股份有限公司 | Earthquake seismographic first arrival identification method and system based on machine learning |
AU2020100630A4 (en) * | 2020-04-24 | 2020-06-04 | Kaplan, Umit Emrah MR | System and method for grade estimation using gradient boosted decesion tree based machine learning algorithims |
CN111310860A (en) * | 2020-03-26 | 2020-06-19 | 清华大学深圳国际研究生院 | Method and computer-readable storage medium for improving performance of gradient boosting decision trees |
CN111999765A (en) * | 2020-08-14 | 2020-11-27 | 广西大学 | Microseismic multi-precursor method and device for early warning of instability of falling karst dangerous rock |
CN112529112A (en) * | 2020-12-29 | 2021-03-19 | 中国地质科学院地质力学研究所 | Mineral identification method and device |
US11169288B1 (en) * | 2017-12-07 | 2021-11-09 | Triad National Security, Llc | Failure prediction and estimation of failure parameters |
US20210350274A1 (en) * | 2020-05-07 | 2021-11-11 | International Business Machines Corporation | Dataset management in machine learning |
CN114067092A (en) * | 2022-01-17 | 2022-02-18 | 山东药品食品职业学院 | Fatty liver B-mode ultrasound image classification method based on DenseNet and lightGBM |
CN114153976A (en) * | 2021-12-10 | 2022-03-08 | 华南理工大学 | Traffic incident classification method, system and medium based on social media data |
WO2022053147A1 (en) * | 2020-09-11 | 2022-03-17 | Swiss Reinsurance Company Ltd. | Mobile device and system for identifying and/or classifying occupants of a vehicle and corresponding method thereof |
WO2022088979A1 (en) * | 2020-10-26 | 2022-05-05 | 四川大学华西医院 | Method for accelerating system evaluation updating by integrating a plurality of bert models by lightgbm |
CN114676932A (en) * | 2022-04-18 | 2022-06-28 | 工银瑞信基金管理有限公司 | Bond default prediction method and device based on class imbalance machine learning framework |
CN114757285A (en) * | 2022-04-18 | 2022-07-15 | 广西师范大学 | A Trusted Federated Gradient Boosting Decision Tree Training Method Based on Trusted Incentives |
CN115050477A (en) * | 2022-06-21 | 2022-09-13 | 河南科技大学 | Bayesian optimization based RF and LightGBM disease prediction method |
CN115099266A (en) * | 2022-05-31 | 2022-09-23 | 上海工程技术大学 | Hard vehicle surface white layer prediction method based on gradient lifting decision tree |
WO2022205768A1 (en) * | 2021-04-02 | 2022-10-06 | 四川大学华西医院 | Random contrast test identification method for integrating multiple bert models on the basis of lightgbm |
CN115308799A (en) * | 2022-09-05 | 2022-11-08 | 中国地质科学院地质力学研究所 | Seismic imaging free gas structure identification method and system |
US11527786B1 (en) * | 2022-03-28 | 2022-12-13 | Eatron Technologies Ltd. | Systems and methods for predicting remaining useful life in batteries and assets |
CN115631739A (en) * | 2022-10-12 | 2023-01-20 | 广州蓝深科技有限公司 | Music chord identification method based on LightGBM algorithm |
CN115759435A (en) * | 2022-11-24 | 2023-03-07 | 辽宁东科电力有限公司 | Photovoltaic power generation power prediction method based on improved CNN-LSTM |
CN116341728A (en) * | 2023-03-16 | 2023-06-27 | 电子科技大学 | Ultra-short-term photovoltaic output power prediction method based on data driving |
US20230222397A1 (en) * | 2022-01-07 | 2023-07-13 | Saudi Arabian Oil Company | Method for automated ensemble machine learning using hyperparameter optimization |
WO2023137434A1 (en) * | 2022-01-13 | 2023-07-20 | Schlumberger Technology Corporation | Reflection seismology inversion with quality control |
CN116609852A (en) * | 2023-07-06 | 2023-08-18 | 中国石油大学(华东) | A high-precision modeling method and equipment for underground medium parameters based on well-seismic fusion |
CN116756679A (en) * | 2023-05-19 | 2023-09-15 | 中法渤海地质服务有限公司 | Multi-source information fusion-based method for judging geological mode of down-the-hole mountain |
WO2023197612A1 (en) * | 2022-04-15 | 2023-10-19 | 湖南大学 | Automatic data augmentation-based medical image segmentation method |
CN117035151A (en) * | 2023-06-25 | 2023-11-10 | 西安石油大学 | Unstable water injection working system optimization method and system based on lightGBM algorithm |
-
2023
- 2023-12-26 CN CN202311804932.7A patent/CN117763356A/en active Pending
Patent Citations (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11169288B1 (en) * | 2017-12-07 | 2021-11-09 | Triad National Security, Llc | Failure prediction and estimation of failure parameters |
CN110889308A (en) * | 2018-09-07 | 2020-03-17 | 中国石油化工股份有限公司 | Earthquake seismographic first arrival identification method and system based on machine learning |
CN110413494A (en) * | 2019-06-19 | 2019-11-05 | 浙江工业大学 | A LightGBM Fault Diagnosis Method Based on Improved Bayesian Optimization |
DE202020101012U1 (en) * | 2020-02-25 | 2020-03-08 | Robert Bosch Gmbh | Device for predicting a suitable configuration of a machine learning system for a training data set |
CN111310860A (en) * | 2020-03-26 | 2020-06-19 | 清华大学深圳国际研究生院 | Method and computer-readable storage medium for improving performance of gradient boosting decision trees |
AU2020100630A4 (en) * | 2020-04-24 | 2020-06-04 | Kaplan, Umit Emrah MR | System and method for grade estimation using gradient boosted decesion tree based machine learning algorithims |
US20210350274A1 (en) * | 2020-05-07 | 2021-11-11 | International Business Machines Corporation | Dataset management in machine learning |
CN111999765A (en) * | 2020-08-14 | 2020-11-27 | 广西大学 | Microseismic multi-precursor method and device for early warning of instability of falling karst dangerous rock |
WO2022053147A1 (en) * | 2020-09-11 | 2022-03-17 | Swiss Reinsurance Company Ltd. | Mobile device and system for identifying and/or classifying occupants of a vehicle and corresponding method thereof |
WO2022088979A1 (en) * | 2020-10-26 | 2022-05-05 | 四川大学华西医院 | Method for accelerating system evaluation updating by integrating a plurality of bert models by lightgbm |
CN112529112A (en) * | 2020-12-29 | 2021-03-19 | 中国地质科学院地质力学研究所 | Mineral identification method and device |
WO2022205768A1 (en) * | 2021-04-02 | 2022-10-06 | 四川大学华西医院 | Random contrast test identification method for integrating multiple bert models on the basis of lightgbm |
CN114153976A (en) * | 2021-12-10 | 2022-03-08 | 华南理工大学 | Traffic incident classification method, system and medium based on social media data |
US20230222397A1 (en) * | 2022-01-07 | 2023-07-13 | Saudi Arabian Oil Company | Method for automated ensemble machine learning using hyperparameter optimization |
WO2023137434A1 (en) * | 2022-01-13 | 2023-07-20 | Schlumberger Technology Corporation | Reflection seismology inversion with quality control |
CN114067092A (en) * | 2022-01-17 | 2022-02-18 | 山东药品食品职业学院 | Fatty liver B-mode ultrasound image classification method based on DenseNet and lightGBM |
US11527786B1 (en) * | 2022-03-28 | 2022-12-13 | Eatron Technologies Ltd. | Systems and methods for predicting remaining useful life in batteries and assets |
WO2023197612A1 (en) * | 2022-04-15 | 2023-10-19 | 湖南大学 | Automatic data augmentation-based medical image segmentation method |
CN114757285A (en) * | 2022-04-18 | 2022-07-15 | 广西师范大学 | A Trusted Federated Gradient Boosting Decision Tree Training Method Based on Trusted Incentives |
CN114676932A (en) * | 2022-04-18 | 2022-06-28 | 工银瑞信基金管理有限公司 | Bond default prediction method and device based on class imbalance machine learning framework |
CN115099266A (en) * | 2022-05-31 | 2022-09-23 | 上海工程技术大学 | Hard vehicle surface white layer prediction method based on gradient lifting decision tree |
CN115050477A (en) * | 2022-06-21 | 2022-09-13 | 河南科技大学 | Bayesian optimization based RF and LightGBM disease prediction method |
CN115308799A (en) * | 2022-09-05 | 2022-11-08 | 中国地质科学院地质力学研究所 | Seismic imaging free gas structure identification method and system |
CN115631739A (en) * | 2022-10-12 | 2023-01-20 | 广州蓝深科技有限公司 | Music chord identification method based on LightGBM algorithm |
CN115759435A (en) * | 2022-11-24 | 2023-03-07 | 辽宁东科电力有限公司 | Photovoltaic power generation power prediction method based on improved CNN-LSTM |
CN116341728A (en) * | 2023-03-16 | 2023-06-27 | 电子科技大学 | Ultra-short-term photovoltaic output power prediction method based on data driving |
CN116756679A (en) * | 2023-05-19 | 2023-09-15 | 中法渤海地质服务有限公司 | Multi-source information fusion-based method for judging geological mode of down-the-hole mountain |
CN117035151A (en) * | 2023-06-25 | 2023-11-10 | 西安石油大学 | Unstable water injection working system optimization method and system based on lightGBM algorithm |
CN116609852A (en) * | 2023-07-06 | 2023-08-18 | 中国石油大学(华东) | A high-precision modeling method and equipment for underground medium parameters based on well-seismic fusion |
Non-Patent Citations (2)
Title |
---|
李媛: "《小波变换及其工程应用》", 30 April 2010, 北京邮电大学出版社, pages: 68 - 69 * |
闫星宇;顾汉明;肖逸飞;任浩;倪俊;: "XGBoost算法在致密砂岩气储层测井解释中的应用", 石油地球物理勘探, no. 02, 15 April 2019 (2019-04-15) * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11320551B2 (en) | Training machine learning systems for seismic interpretation | |
CN112989708B (en) | A logging lithology identification method and system based on LSTM neural network | |
Anifowose et al. | Fuzzy logic-driven and SVM-driven hybrid computational intelligence models applied to oil and gas reservoir characterization | |
CN112083498B (en) | Multi-wave earthquake oil and gas reservoir prediction method based on deep neural network | |
CN106372402B (en) | A parallelization method of fuzzy region convolutional neural network in big data environment | |
CN111783825A (en) | Well logging lithology identification method based on convolutional neural network learning | |
CN113570000A (en) | Ocean single-factor observation quality control method based on multi-model fusion | |
CN112396130A (en) | Intelligent identification method and system for rock stratum in static sounding test, computer equipment and medium | |
CN106529667A (en) | Logging facies identification and analysis method based on fuzzy depth learning in big data environment | |
CN112836802A (en) | Semi-supervised learning method, lithology prediction method and storage medium | |
CN105760673A (en) | Fluvial facies reservoir earthquake sensitive parameter template analysis method | |
CN114818076B (en) | Machine learning-based fault closed hydrocarbon column height evaluation method | |
CN113534261A (en) | Reservoir gas-bearing detection method and device based on intelligent optimization integrated network | |
Zhou et al. | Sequential data-driven cross-domain lithology identification under logging data distribution discrepancy | |
Ye et al. | Drilling formation perception by supervised learning: Model evaluation and parameter analysis | |
CN106446514A (en) | Fuzzy theory and neural network-based well-log facies recognition method | |
Efendiyev et al. | Estimation of lost circulation rate using fuzzy clustering of geological objects by petrophysical properties | |
Khan et al. | Applicability of deep neural networks for lithofacies classification from conventional well logs: An integrated approach | |
CN117763356A (en) | Rapid earthquake phase identification method based on LightGBM algorithm | |
Hong et al. | A novel approach to the automatic classification of wireline log-predicted sedimentary microfacies based on object detection | |
Kim | Synthetic shear sonic log generation utilizing hybrid machine learning techniques | |
CN117093922A (en) | Improved SVM-based complex fluid identification method for unbalanced sample oil reservoir | |
DENG et al. | A Real‐time Lithological Identification Method based on SMOTE‐Tomek and ICSA Optimization | |
Liu et al. | Characterization of lacustrine shale oil reservoirs based on a hybrid deep learning model: A data-driven approach to predict lithofacies, vitrinite reflectance, and TOC | |
de Oliveira et al. | Ensemble of heterogeneous classifiers applied to lithofacies classification using logs from different wells |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |