CN117763356A

CN117763356A - Rapid earthquake phase identification method based on LightGBM algorithm

Info

Publication number: CN117763356A
Application number: CN202311804932.7A
Authority: CN
Inventors: 张�浩; 方欣欣; 张重远; 施辉; 周远剑
Original assignee: INSTITUTE OF GEOMECHANICS CHINESE ACADEMY OF GEOLOGICAL SCIENCES
Current assignee: INSTITUTE OF GEOMECHANICS CHINESE ACADEMY OF GEOLOGICAL SCIENCES
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-26

Abstract

The invention relates to the technical field of seismic facies identification, in particular to a rapid seismic facies identification method based on a LightGBM algorithm, which comprises the following steps: s1, manufacturing a seismic phase classification label on a seismic imaging data training set, and then preprocessing and expanding the data set to finish the preparation of the data set and the label; s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a prediction set, wherein the training set is used for training a LightGBM model, and the verification set is used for evaluating the performance of the model.

Description

Rapid earthquake phase identification method based on LightGBM algorithm

Technical Field

The invention relates to the technical field of seismic facies identification, in particular to a rapid seismic facies identification method based on a LightGBM algorithm.

Background

In the exploration and development of underground resources such as oil gas, coal mine, brine mineral products and the like, the research of the deposition environment has very important significance.

Conventionally, the sedimentary facies are usually studied by using cores or outcrops, however, in the vast area without outcrops, the sedimentary facies of a target layer can only be observed through core data, but because the coring rate of a well is low, generally several percent to tens percent, and the drilling coring process is not continuous, even if the phase analysis of a single well is sufficient and accurate, the sedimentary facies space spread characteristics of a region are difficult to reflect, and the sedimentary facies distribution of a work area needs to be sufficiently dense, but in practical production, the sedimentary facies distribution of a work area is very difficult to meet, so a new technology for better obtaining the spread characteristics of the sedimentary facies in the area by only needing a small amount of drilling core data needs to be developed, and the seismic facies analysis technology is generated to solve the problems.

With the deep exploration degree, the energy industry faces new situations and challenges, so that the seismic phase analysis technology is required to develop in a high-efficiency and fine direction, and the qualitative or semi-quantitative analysis of the manually-guided seismic phase in the prior art has insufficient reliability of data analysis due to subjectivity and unreliability of manual experience, so that the current complex production and development requirements cannot be met.

Disclosure of Invention

The invention aims to provide a rapid earthquake phase identification method based on a LightGBM algorithm, which aims to solve the problems in the background art.

The aim of the invention can be achieved by the following technical scheme:

a rapid earthquake phase identification method based on a LightGBM algorithm comprises the following steps:

s1, manufacturing a seismic phase classification label on a seismic imaging data training set, and then preprocessing and expanding the data set to finish the preparation of the data set and the label;

s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a prediction set, wherein the training set is used for training the LightGBM model, and the prediction set is used for evaluating the performance of the model.

S3, adjusting the super parameters of the LightGBM model in a cross-validation mode, fitting a prediction model based on a training set by using the optimal super parameter combination, and comprehensively evaluating the performance of the model by utilizing the data of the prediction set to obtain an optimal classification model based on the current task;

s4, inputting the data set for verification into a prediction model to obtain an automatic classification result of the earthquake phase, and completing automatic classification of the earthquake item.

Preferably, in the step S1, the expanding the data set includes the following steps:

s11, classifying the data to be detected by adopting a manual interpretation mode in combination with the existing seismic interpretation method;

s12, selecting a maximum probability value and comparing the maximum probability value with a preset probability threshold lambda, if not smaller than lambda, outputting a seismic phase class corresponding to the maximum probability value as a recognition result, otherwise, performing data expansion;

s13, labeling the classification result to the data to be detected, and then adding the classification result to a sample data set;

and S14, maintaining the original model by using the new sample data so as to realize the identification of the new seismic phase type.

Preferably, in the step S1, when the seismic phase classification label is manufactured, the seismic phase classification is classified into 9 classes, and some of the noise data with low correlation is removed by using a feature selection manner, so as to improve the model training efficiency and generalization capability.

Preferably, the LightGBM algorithm is a gradient lifting decision tree model, the feature selection adopts an embedded feature selection mode, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.

Preferably, in the calculation method of the importance of the LightGBM feature, the global contribution rate of the feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows:

wherein T represents decision trees, and M is the number of decision trees. The contribution rate of the feature j in a single tree is as common as

Formula 3-4:

where L is the number of leaf nodes of the tree, L-1 is the number of non-leaf nodes, vt is a feature associated with node t,is the reduced value of the square loss after node t splits.

The invention has the beneficial effects that:

the invention can develop the earthquake phase automatic quantitative analysis technology by means of strong computer computing capability, is beneficial to improving the working efficiency of the interpretation of petroleum geophysical prospecting data, shortening the working period and reducing the working cost, is beneficial to reducing subjectivity and unreliability of manual experience, enhancing the reliability of data analysis, and improving the capability and application effect of the petroleum industry for solving the complex exploration and development problem.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it will be obvious to those skilled in the art that other drawings can be obtained according to these drawings without inventive effort;

FIG. 1 is a schematic flow chart of the present invention;

FIG. 2 is a sample of input training data and tag data in accordance with the present invention;

FIG. 3 is seismic facies interpretation tag data corresponding to the sample data of FIG. 2 in accordance with the present invention;

FIG. 4 is a graph of training loss function change of the LightGBM model according to the invention;

FIG. 5 is a plot of experimental seismic imaging profile data for use in the present invention;

FIG. 6 is a graph of the automatically predicted seismic phase classification results using the LightGBM method of the present invention;

FIG. 7 is a schematic cross-validation diagram in accordance with the present invention;

FIG. 8 is a schematic diagram of a decision tree model in the present invention;

FIG. 9 is a schematic diagram of ensemble learning in accordance with the present invention;

FIG. 10 is a schematic illustration of a GBDT according to the present invention;

FIG. 11 is a schematic diagram of a histogram algorithm in the present invention;

FIG. 12 is a schematic diagram of a layer-by-layer growth strategy in the present invention;

FIG. 13 is a schematic of a leaf-by-leaf growth strategy in the present invention;

fig. 14 is a schematic diagram of a feature rotation technique in the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention aims to realize an automatic interpretation of a seismic phase of seismic imaging data based on a lightweight gradient lifting algorithm (LightGBM) method. To achieve the above object, the present invention adopts the steps of:

on the training set of the seismic imaging data (shown in figure 2), the making of the seismic phase classification labels (shown in figure 3) is carried out based on expert experience knowledge, and then the data set is preprocessed and expanded to prepare the data set and labels.

According to the following steps: 4, splitting the seismic imaging data and the tag data set into a training set and a testing set, wherein the training set is used for the training process of the LightGBM model, and the prediction set is used for evaluating the performance of the model.

The LightGBM model hyper-parameters are adjusted by Cross Validation (CV). And (3) fitting a prediction model based on a training set by using the optimal super-parameter combination, and comprehensively evaluating the performance of the model by utilizing the prediction set data to obtain an optimal classification model based on the current task.

Inputting the data set for verification into a prediction model to obtain an automatic classification result of the earthquake phase. Automatic classification of the seismic items based on machine learning is completed.

Unlike the conventional method of creating disposable data set, the present invention completes the creation and dynamic maintenance of data set gradually in dynamic expansion mode, and the main purpose of data expansion is to raise the classification capacity of model continuously and reduce the continuous influence of the deficiency of early sample data on subsequent training.

Firstly, classifying data to be tested by adopting an expert interpretation mode in combination with a traditional seismic interpretation method;

step two, selecting a maximum probability value and comparing the maximum probability value with a preset probability threshold lambda, if not smaller than lambda, outputting a seismic phase class corresponding to the maximum probability value as a recognition result, otherwise, performing data expansion;

labeling the classification result to the data to be detected, and then adding the classification result to a sample data set;

and fourthly, maintaining the original model by using new sample data, so that the model has the capability of identifying new seismic phase types.

In view of the difference of reflection characteristics of the seismic imaging, the possible seismic phases are classified into 9 categories, and some of the possible seismic phases belong to noise data with low correlation, and the possible seismic phases need to be removed by means of characteristic selection so as to improve model training efficiency and generalization capability.

Because the LightGBM algorithm adopted by the invention is a gradient lifting decision tree model, an embedded feature selection mode is selected, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.

In the LightGBM feature importance calculation method, the global contribution rate of a feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows:

wherein T represents decision trees, and M is the number of decision trees. The contribution rate formula of the feature j in the single tree is as follows:

The experimental environment for the model construction of the invention is shown in the following table one:

table-experiment environment configuration table

First, according to 6:4, dividing the seismic phase classification data set into a training set and a testing set, wherein the training set is used for the training process of the LightGBM model, and the prediction set is used for evaluating the performance of the model. And secondly, adjusting the super parameters of the LightGBM model by means of Cross Validation (CV). And thirdly, fitting a prediction model based on a training set by using an optimal super-parameter combination, and comprehensively evaluating the performance of the model by using prediction set data to obtain an optimal model based on the current task.

(1) Super parameter optimization

In machine learning, a superparameter is a parameter used to control the learning process, and is usually determined before learning, unlike other parameters that are derived through a training process. In machine learning model construction, the values of the hyper-parameters typically need to be adjusted based on the data set, common methods include grid searching, bayesian optimization, heuristic searching, and random searching. Because the random search method has higher algorithm efficiency in the multi-super-parameter optimizing task, the research determines the optimal super-parameter combination based on the method.

Meanwhile, for objectively judging the coincidence degree of the training parameters to the data outside the training set, 5-fold cross validation is generally adopted to carry out super-parameter configuration, as shown in fig. 7, the original training set is randomly divided into 5 subsets with equal size, 1 subset is used as a validation subset, the other 4 subsets are used as training subsets, and model searching is carried out on the basis of the validation subsets. The above process is repeated 5 times until each subset is taken as a verification set, and finally, the optimal super-parameter combination is determined by combining the average precision of 5 times of parameter searching. The results of this experiment are shown in Table II:

meter two optimal super parameter combination table

(2) Model evaluation

In general, the process of machine learning model construction is actually a process of minimizing a loss function through parameter adjustment. In the Multi-classification problem, a Multi-log loss function (the formula is as follows) is generally selected as a standard for measuring the prediction capability of the model, the loss function gradually converges with the model optimization process (as shown in fig. 4), and the loss function converges to a certain degree without decreasing, namely, the loss function indicates that the model optimization is completed.

Wherein n is the number of predicted samples; m is the number of species; y is _i,j Is a true class, if the ith sample belongs to the jth class, y _i,j ＝1；p _i,j Is the probability that the ith sample belongs to the jth class in the model prediction result.

Experimental results

The experimental data discloses test data for identifying the earthquake phase for SEG, the data comprises a plurality of typical earthquake phases, the flow of the application of the method is shown in figure 1, on a training set of earthquake imaging data, the manufacture of earthquake phase classification labels is carried out based on expert experience knowledge, then the data set is preprocessed and expanded, and the preparation of the data set and the labels is carried out.

The data set for verification is input into a prediction model (shown in fig. 5) to obtain an automatic classification result of the seismic phase (shown in fig. 6). Automatic classification of the seismic items based on machine learning is completed.

In the present invention, the related art used is as follows:

1. lightweight gradient lifting algorithm LightGBM method principle

1.1 decision Tree Algorithm

The LightGBM method is essentially a method developed from the traditional decision tree method, so that the basic principles of decision trees and the like are introduced before the LightGBM method is introduced.

Decision Tree (Decision Tree) is a classical Tree-based machine learning algorithm. The method is considered to be an effective tool for solving the classification problem because of the advantages of high calculation efficiency, strong interpretation after the joint and the like. In general, a decision tree contains a root node, a number of child nodes and leaf nodes. The root node and each child node correspond to an attribute test, and the samples are sequentially divided into the child nodes of the next layer according to the attribute test result. The method is continuously recursion until the samples in the nodes all belong to the same class or cannot be divided, the nodes are called leaf nodes and correspond to a decision result at the same time, a decision tree model is shown in fig. 8, xi represents various attributes, a, b and c represent decision thresholds of the various attributes, and A, B, C, D represents different decision results.

In summary, the construction of the decision tree model is a top-down process, and each construction mainly comprises two steps of feature selection and decision tree pruning.

The selection steps of the attribute features are as follows:

the process of decision tree growth, namely the process of generating new leaf nodes by splitting leaf nodes, wherein the splitting represents splitting of a sample data set in a current node, and whether proper features can be efficiently selected as splitting attributes is an important reference standard for measuring the quality of a decision tree algorithm. At present, various decision tree algorithms mainly adopt Entropy (Entropy) and a coefficient of keni (Gini) as classification indexes to judge the degree of confusion inside the features, so as to find the growth direction of the decision tree which can promote the purity of samples in nodes to the highest degree, wherein the degree of confusion and the purity of information are two opposite concepts, and the two concepts are descriptions of state numbers (sample types) in a system (collection). The greater the number of sample types, the greater the degree of confusion and the lower the purity and vice versa.

Assuming that the discrete random variable X obeys the probability distribution of the following formula, let pi denote the probability of variable i:

P(X＝x _i )＝p _i ,i＝1,2,…,n

the mathematical expression of entropy is as follows:

the formula for the calculation of the coefficient of kunning is as follows:

in summary, the entropy and the coefficient function are similar, and can be used as the measurement index of the chaotic degree of the characteristic data. When the fewer the categories, i.e., the higher the data purity, the lower the entropy and the coefficient of kunning; the more categories, i.e., the lower the data purity, the higher the entropy and the coefficient of kunning.

The pruning step of the decision tree is as follows:

during machine learning, the fitting phenomenon often occurs, that is, the objective function excessively depends on the training sample set, and even each sample (including noise) is fitted into the function, so that the fitting phenomenon is excellent only in the training set, and for the situation that an unknown sample cannot be predicted correctly, if a model is constructed only through a feature selection mode, the fitting problem occurs during the generation process of the model, and pruning processing is needed for the decision tree model.

Pruning of decision trees is generally divided into pre-pruning and post-pruning. The pre-pruning is limited according to a preset threshold value such as the maximum decision tree depth, the minimum sample number and the like in the decision tree construction process, and the splitting is stopped when the training process reaches a threshold value condition; and the post pruning is to firstly generate a complete decision tree from the training set and test the non-leaf nodes from bottom to top. When the subtree corresponding to the node is replaced by the leaf node, the improvement of the generalization performance can be brought, and the subtree is replaced by the leaf node.

1.2 Integrated learning

The ensemble learning (Ensemble Learning) refers to a process of performing a learning task by constructing and combining a plurality of homogenous learners (as shown in fig. 9) according to a certain integration strategy, wherein the homogenous learners represent a basic classification model belonging to the same type, such as a neural network, a decision tree and other traditional supervision classification models, and the learners are also called as "base learners", so that an algorithm model with better prediction performance than a single learner can be obtained through the combination of the plurality of base learners, and currently, the ensemble learning can be divided into two major categories of Boosting algorithm (Boosting) and Bagging algorithm (Bagging) according to the different integration strategies.

(1)Boosting

Boosting integration strategies represent the process of multiple weak learners generating strong learners in a serial manner. The weak learner is a learner with performance slightly better than that of random guessing, the strong learner is a learner with accurate prediction capability, and the basic idea of Boosting is to accumulate a plurality of basic learners layer by layer and connect the basic learners in a serial mode, and the sample distribution trained by each basic learner layer is determined by the classification condition of the previous layer: the samples with wrong classification are weighted more and more attention is paid to the next training. And repeating the process until the number of the basic learners reaches the preset number T, and finally outputting the above T basic learners in series to obtain the strong learner. In summary, boosting strategies are focused on reducing the sample data bias of the ensemble learner, and can be implemented as a strong learner by integrating multiple weak learners.

(2)Bagging

Different from the serial generation mode of Boosting, bagging trains in a parallel mode, and each base classifier adopts the same training sample, so that strong dependency relationship among the base classifiers does not exist. However, the classification results tend to be different due to the differences in learning ability between the respective classifiers. Therefore, the final prediction result is finally obtained in a voting mode, and from the perspective of deviation-variance, the Bagging method is more focused on the reduction of sample variance in the ensemble learning, and the ensemble learning is carried out by adopting a Bagging strategy through an algorithm which is usually easy to be disturbed by the sample, so that the prediction effect is obviously improved.

1.3 gradient-lifting decision Tree

The gradient Boosting decision tree (Gradient Boosting decision tree, GBDT) algorithm is an iterative decision tree algorithm that uses the ensemble-learning Boosting concept. Generally, GBDT is a decision tree based classifier that uses the negative gradient of the loss function as an approximation to the lifting tree residual to perform an algorithmic implementation.

The mathematical description formula of the lifting tree fM (x) is as follows:

wherein Tm (x) is a weak learner, i.e., a decision tree; γm is the weight of the best fit for each weak learner; m is the number of trees, i.e., the number of iterations.

Model training is the process of minimizing the loss function L. Assuming that the data size of the training sample is N, and the variable and the true value of the ith data are xi and yi respectively, the formula of the objective function of parameter tuning is as follows:

in the method, in the process of the invention,representing a predictive model of the completion of the training; l is the loss function during training.

In connection with fig. 10, the gbdt algorithm flow is summarized as follows:

initializing a weak learner to obtain an initial prediction model f0 (x):

wherein L is a loss function; gamma is the weak learner model weight.

For each iteration m=1, 2, …, M, a negative gradient is calculated, i.e. the residual rim:

will (x) _i ,r _im ) As training data of the next decision tree, fitting to obtain a new decision tree f _m (x) A. The invention relates to a method for producing a fibre-reinforced plastic composite The corresponding leaf node set is Rjm (j=1, 2, …, J), and J is the number of leaf nodes. Calculating a best fit value for the above aggregate range:

updating the regression tree:

where I is an indicator function, 1 when x belongs to the leaf node set Rjm, and 0 otherwise.

Outputting the final model fM (x):

and constructing the GBDT model according to the steps and applying the GBDT model to the process of classifying the problems, and measuring the performance of the model according to the probability distribution condition predicted by the model after the predicted result is obtained. It is common practice to employ a normalized exponential function (Softmax).

Softmax is a generalization of logic functions on multi-classification tasks, whose purpose is to reveal multi-classification results in the form of probabilities. If in D _T Representing a sample training set, D _T ＝{(x _i ,y _i ),i＝1,…,n _T }. Wherein x is _i Is the characteristic data input by the model; y is _i Is the corresponding class label. Assuming that the number of kinds of different classification units of the training set is K, n is generally the same as _T >K. On the classification problem, the role of the LightGBM is to calculate the mapping function f R between xi and yi ¹⁵ →R ^K . For the input x, outputting a P-dimensional feature vector v, and substituting the P-dimensional feature vector v into a Softmax function to calculate a classification probability value:

wherein p is _k Representing the predictive probability value belonging to the kth class, the sum of the predictive probabilities of each class must be 1 for any data x, as known from the Softmax equation.

1.4 lightweight gradient lifting algorithm

The lightweight gradient lifting algorithm (Light Gradient Boosting Machine, lightGBM) is a core method based on the present invention, and is a fast, high-performance algorithm based on GBDT framework, published in 2017 by microsoft asia institute on the gate open website, and features are selected and split in GBDT by adopting a pre-ordering manner, so that the splitting point can be precisely determined, but a large amount of memory and time are consumed at the same time. The LightGBM can employ a histogram-based approach and a Leaf-by-Leaf growth (Leaf-wise) strategy with depth limitation to increase training speed and reduce memory consumption.

The basic idea of the histogram algorithm employed by LightGBM (fig. 11) is to build a histogram on the basis of discretizing continuous floating point eigenvalue data into S bins ("buckets"). The above discrete values are used as indexes in the process of traversing the data, statistics (the number of samples in each bucket) are accumulated into corresponding histograms, and then all the discrete values are traversed to find the optimal partition point. The advantages of the histogram algorithm mainly include: (1) The algorithm operation efficiency is improved, the calculation time is reduced, so that the algorithm time complexity is reduced from O (N) to O (S), namely, tasks which originally need to be completed by N times of calculation can be completed by only S times, and S is smaller than N; (2) The memory occupation of the data is reduced, the continuous values are discretized by utilizing a bucket (bin), and the training data can be stored by using smaller data types under the condition that the bin value is smaller.

The growth of the decision tree in the LightGBM is different from the layer-by-layer growth (Level-wise) strategy (fig. 12) adopted by other decision tree algorithms, but adopts a Leaf-by-Leaf growth (Leaf-wise) strategy (fig. 13), wherein in the decision tree generation process, when the nodes are split, a whole layer of new Leaf nodes are indiscriminately generated, and the obtained decision tree model algorithm has lower complexity but causes memory consumption of a large number of invalid or low-efficiency nodes. The Leaf-wise strategy selects the nodes with the biggest benefits according to entropy or a coefficient of a radix, so that analysis and operation efficiency is greatly improved, and in addition, the biggest growth depth limit is required to be added on the basis of Leaf-wise in consideration of the fact that the strategy possibly causes an overfitting problem due to overlarge depth of a decision tree.

1.5 model feature selection

Referring to fig. 14, fig. 14 is a schematic diagram of a feature rotation technique in the present invention, and feature selection is also called attribute selection, and the basic idea is to select n (n < m) features suitable for model construction according to a certain standard from m features in a dataset, and not every feature contributes to model construction in application scenarios of multidimensional data such as natural language identification, medical gene diagnosis, remote sensing image processing, etc. The large number of redundant features are more likely to cause dimension disasters, so that the complexity of the model is multiplied, and the construction efficiency and accuracy of the model are seriously affected.

According to the difference of the interaction modes of the classifier, the feature selection is mainly divided into 3 types of filtering type, wrapping type and embedded type.

The foregoing has shown and described the basic principles, principal features and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and that the above embodiments and descriptions are merely illustrative of the principles of the present invention, and various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined in the appended claims.

Claims

1. The rapid earthquake phase identification method based on the LightGBM algorithm is characterized by comprising the following steps of:

s2, according to 6:4, splitting the seismic imaging data and the tag data set into a training set and a verification set, wherein the training set is used for training the LightGBM model, and the verification set is used for evaluating the performance of the model.

2. The method for rapidly identifying a seismic facies based on the LightGBM algorithm according to claim 1, wherein in the step S1, the expansion of the data set comprises the following steps:

3. The method for quickly identifying the seismic facies based on the LightGBM algorithm according to claim 2, wherein in the step S1, when the seismic facies classification label is made, the seismic facies classification is classified into 9 classes, and some of noise data belonging to low correlation are removed by using a feature selection manner, so as to improve the training efficiency and generalization capability of the model.

4. The method for quickly identifying the seismic facies based on the LightGBM algorithm according to claim 3, wherein the LightGBM algorithm is a gradient lifting decision tree model, the feature selection adopts an embedded feature selection mode, the feature importance is calculated according to the contribution rate of the tree model structure, and various features with relatively high importance are preferentially selected.

5. The method for quickly identifying a seismic facies based on the LightGBM algorithm according to claim 4, wherein in the method for calculating the importance of the LightGBM features, the global contribution rate of the feature j is measured by an average value of the contribution rates of the feature j in a single tree, and the formula is as follows: