CN110726694A

CN110726694A - Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm

Info

Publication number: CN110726694A
Application number: CN201911006149.XA
Authority: CN
Inventors: 张小鸣; 刘鑫; 李绍稳; 金�秀
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2019-10-22
Filing date: 2019-10-22
Publication date: 2020-01-24

Abstract

The invention discloses a characteristic wavelength selection method and a system of a spectrum variable gradient integrated genetic algorithm, which divide a full spectrum into a plurality of wavelength intervals; extracting important wavelength intervals from all the wavelength intervals according to the important projection coefficients and combining the important wavelength intervals into an interval spectrum; taking a random combination characteristic wavelength vector of an interval spectrum as an initial population of a genetic algorithm, taking the reciprocal of the root mean square error of a partial least squares regression model as a fitness function of the characteristic wavelength vector, and selecting the characteristic wavelength vector with the maximum fitness value as an optimal characteristic wavelength vector by utilizing the genetic algorithm; selecting, crossing and mutating the initial population, and replacing the original population with the obtained new individual to form a new population; and (5) iterating to an evolutionary algebra, and outputting a final optimal characteristic wavelength vector. The invention solves the problem of selecting the co-linearity and the redundant wavelength variable in the prior art, simplifies the calculation, improves the prediction precision and ensures that the regression model has better interpretation capability and generalization capability.

Description

Characteristic wavelength selection method and system of spectral variable gradient integrated genetic algorithm

Technical Field

The invention relates to the technical field of spectral analysis, in particular to a characteristic wavelength selection method and a characteristic wavelength selection system of a spectral variable gradient integrated genetic algorithm.

Background

Spectroscopic analysis is a method of identifying substances and determining their chemical composition and relative content based on their spectra, and is an analytical method established based on molecular and atomic spectroscopy. Since each atom has its own characteristic spectrum, it is possible to identify a substance and determine its chemical composition from the spectrum. The substance can be qualitatively analyzed by using characteristic spectra of different spectral analysis methods, and quantitatively analyzed according to spectral intensity. The characteristic wavelength selected when establishing the spectrum detection model has a great influence on the accuracy of the model. The existing characteristic wavelength variable selection method based on the group intelligent optimization algorithm has the problems that the probability of selecting weak-correlation wavelength variables is high, and the local optimal solution is easy to fall into.

When a soil sample is irradiated by the visible near infrared spectrum, various chemical groups containing hydrogen elements (such as C-H, O-H, S-H, N-H and the like) in soil nutrient substances are excited to generate frequency doubling and frequency combination absorption information of molecular vibration, and the content of the soil nutrient can be accurately measured by measuring the absorption degree of the visible near infrared spectrum of the soil nutrient by using a visible near infrared spectrum analysis technology. However, each soil nutrient has its own absorption spectrum wavelength, and the absorption signal is weak, the bands are overlapped, and interference information such as environmental noise and irrelevant information is also included, so that the near infrared absorption spectrum of the soil sample is extremely complex. In addition, the spectral data of the same sample has a collinear relationship, so that data redundancy is easy to generate. If a regression model is established by using full spectrum data, the problems of visible near infrared spectrum height overlapping and collinearity between adjacent characteristic variables are difficult to eliminate, the prediction precision is not high, and the model is complex and weak in generalization capability. The characteristic wavelength selection method based on the group intelligent optimization algorithm takes the root mean square error of the PLSR as an objective function, and randomly searches the characteristic wavelength vector with the minimum root mean square error. However, the characteristic wavelength variable is selected in the visible near-infrared full spectrum range, and the probability of selecting the wavelength variable with weak correlation is high, so that the wavelength variable easily falls into a local optimal solution. Therefore, the optimal selection of the interval spectrum with the maximum correlation with the target variable of the soil nutrient from the wavelength variables of the visible near-infrared full spectrum and the selection of the characteristic wavelength variable on the interval spectrum become the key technology for improving the prediction precision of the soil nutrient.

Disclosure of Invention

The invention aims to solve the problems that the probability of selecting weak-correlation wavelengths is high and the near infrared spectrum analysis precision is to be improved due to the fact that a characteristic wavelength selection method adopted by the existing near infrared spectrum analysis method is easy to fall into a local optimal solution, and provides a characteristic wavelength selection method and a characteristic wavelength selection system of a spectrum variable gradient integrated genetic algorithm.

In one aspect, the invention provides a characteristic wavelength selection method of a spectral variable gradient integrated genetic algorithm, which comprises the following steps:

scanning a plurality of soil samples by using visible near infrared spectrum scanning equipment to generate a visible near infrared spectrum data matrix, establishing a partial least squares regression model for full spectrum wavelength variables contained in the spectrum data matrix, and determining importance projection coefficients of the full spectrum wavelength variables;

dividing the full spectrum of the spectrum data matrix into a plurality of wavelength intervals, and extracting the wavelength intervals with the important projection coefficients of the wavelength variables larger than a preset value from all the wavelength intervals to obtain important wavelength intervals;

merging the important wavelength intervals of the spectrum data matrix into an interval spectrum, taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least square regression model;

taking the reciprocal of the root mean square error of the partial least squares regression model as a fitness function of the characteristic wavelength vector, and selecting the characteristic wavelength vector with the maximum fitness value as an optimal characteristic wavelength vector; selecting, crossing and mutating the initial population, and replacing the original population with the obtained new individual to form a new population; and (5) iterating to an evolutionary algebra, and outputting a final optimal characteristic wavelength vector.

Further, after the important wavelength intervals are obtained, one wavelength variable in each important wavelength interval is removed to only the last wavelength variable is left by using a backward interval partial least square regression algorithm, a combined wavelength vector corresponding to the minimum root mean square error of the partial least square regression model in each important wavelength interval is searched, each new important wavelength interval is constructed and combined to form an interval spectrum, the random combined characteristic wavelength vector of the interval spectrum is used as an initial population of the genetic algorithm, and the root mean square error of the partial least square regression model is solved.

Further, the method for dividing the full spectrum into a plurality of wavelength intervals is as follows:

calculating a purity row vector of the full-spectrum wavelength variable and a linear purity gradient vector of the full-spectrum wavelength variable in the horizontal direction; the full spectrum is divided into a plurality of wavelength intervals by using the positive and negative changes of the gradient value in the linear purity gradient vector of the full spectrum wavelength variable.

Further, the fitness function F expression of the characteristic wavelength vector is as follows:

F＝1/RMSE，

wherein RMSE establishes the root mean square error of a partial least squares regression model for full spectrum data matrix column data, y_iThe reference method test value for the ith sample,

predicted value of partial least squares regression model for each characteristic wavelength variable of ith sample, n_pIs the number of samples.

Further, a new population is formed according to the selected population size, the cross probability, the mutation probability and the selection probability, wherein the mutation operator adopts a real number coding differential mutation operator, and a calculation formula is as follows:

Z(i,j)＝D×(E(r1,j)-E(r2,j))+E(i,j)，

wherein Z (i, j) represents a real number-encoded offspring value of the j-th chromosome of the ith individual, D represents a mutation factor, E (r1, j) represents a real number-encoded parent value of the j-th chromosome of the r1 randomly generated in the population, E (r2, j) represents a real number-encoded parent value of the j-th chromosome of the r2 randomly generated in the population, E (r1, j) -E (r2, j) represents a difference value between the real number-encoded parent value of the j-th chromosome of the r 1-th individual and the real number-encoded parent value of the j-th chromosome of the r 2-th individual, and E (i, j) represents an encoded parent value of the j-th chromosome of the ith individual.

Further, the method of extracting important wavelength intervals having an importance projection coefficient of a wavelength variable larger than a predetermined value from all wavelength intervals and combining them into one interval spectrum is as follows:

the wavelength column number in each important wavelength interval is converted into a wavelength index number row vector of an interval spectrum; and the column number range of the interval spectral wavelength index number row vector is the value range of the characteristic wavelength vector elements, and each column of data of the spectral data matrix is obtained through a mapping table of the column number and the specific interval spectral wavelength index number row vector.

In another aspect, the present invention provides a system for selecting a characteristic wavelength of a spectral variable gradient integrated genetic algorithm, comprising:

the partial least square regression model establishing module is used for scanning a plurality of samples by utilizing visible near infrared spectrum scanning equipment to generate a visible near infrared spectrum data matrix, establishing a partial least square regression model for a full spectrum wavelength variable contained in the spectrum data matrix and determining an importance projection coefficient of the full spectrum wavelength variable;

the wavelength interval division module is used for dividing the full spectrum of the spectrum data matrix into a plurality of wavelength intervals;

the important wavelength interval determining module is used for extracting the wavelength interval containing the wavelength variable and the wavelength interval of which the important projection coefficient is greater than the preset value from all the wavelength intervals to obtain an important wavelength interval;

the genetic algorithm selection module is used for combining the important wavelength intervals of the spectrum data matrix into an interval spectrum, taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least square regression model;

Further, the important wavelength interval determining module further includes removing one wavelength variable in each important wavelength interval to only the last wavelength variable by using a backward interval partial least square regression algorithm after obtaining the important wavelength interval, finding a combined wavelength vector corresponding to the minimum root mean square error of the partial least square regression model in each important wavelength interval, and constructing each new important wavelength interval.

The beneficial technical effects of the invention are as follows: the full spectrum of the spectrum data matrix is divided into a plurality of wavelength intervals, the wavelength intervals with the important projection coefficients of wavelength variables larger than a preset value are extracted from all the wavelength intervals, the important wavelength intervals are obtained and combined into an interval spectrum, the random combination characteristic wavelength vector of the interval spectrum is taken as an initial population of the genetic algorithm, the probability of selecting the potential optimal characteristic wavelength variable in the interval spectrum by the genetic algorithm is greatly increased, the problems that the potential optimal characteristic wavelength variable is selected in the visible near-infrared full spectrum wavelength variable by the group intelligent optimization algorithm, colinearity and redundant wavelength variable are selected are solved, the calculation amount of a regression model is simplified, the prediction accuracy is improved, and the regression model has better interpretation capability and generalization capability;

the method comprises the steps of dividing a visible near-infrared full spectrum into a plurality of wavelength intervals by utilizing the positive and negative change times of the variable linear purity gradient value of the visible near-infrared full spectrum, extracting important wavelength intervals with strong interpretability on a predicted target quantity from the visible near-infrared full spectrum by adopting a wavelength variable projection importance coefficient (VIP) output by a partial least squares regression model (PLSR) larger than a preset value as a wavelength interval extraction criterion, combining the important wavelength intervals into an interval spectrum, taking a random combination characteristic wavelength vector of the interval spectrum as an initial population of a genetic algorithm, improving the probability that the genetic algorithm selects a wavelength variable with strong correlation from the interval spectrum, reducing the probability that the wavelength variable with weak correlation is selected from the visible near-infrared full spectrum, being beneficial to eliminating collinearity relations and redundant data, and improving the prediction precision of the regression model;

the combined wavelength vector corresponding to the minimum root mean square error of the partial least square regression model in each important wavelength interval is searched by respectively using a backward interval partial least square regression algorithm after the obtained important wavelength intervals are separated, each new important wavelength interval is constructed and combined into an interval spectrum, the random combined characteristic wavelength vector of the interval spectrum is used as an initial population of the genetic algorithm, the probability of selecting wavelength variables with high correlation in the interval spectrum by the genetic algorithm is further improved, the collinearity relation and redundant data are effectively eliminated, and the prediction precision of the obtained regression model is better;

the invention provides a real number coding differential mutation operator, which utilizes an improved genetic algorithm to enlarge a global optimal solution searching space, enables the improved genetic algorithm to search a global optimal solution and has high convergence speed;

the invention further sets the column number range of the wavelength index number row vector of the interval spectrum as the value range of the characteristic wavelength vector elements, acquires each column data of the spectrum data matrix through the mapping table of the column number and the specific interval spectrum wavelength index number row vector, and establishes a partial least square regression model, so that the characteristic wavelength vector population generation mode and the spectrum matrix data acquisition method are simple and easy to implement.

Drawings

FIG. 1 is a flow chart of a method for selecting a characteristic variable of a spectral variable gradient integrated genetic algorithm according to an embodiment of the present invention;

FIG. 2 is a flow chart of a feature variable selection method of a spectral variable gradient integrated genetic algorithm according to another embodiment of the present invention;

FIG. 3 is a graph of the visible near infrared full spectrum variable purity of a soil fast-acting phosphorus calibration set according to an embodiment of the present invention;

FIG. 4 is a graph of a visible near infrared full spectrum variable purity gradient of a soil fast-acting phosphorus calibration set according to an embodiment of the present invention;

FIG. 5 illustrates a VIP curve of a near infrared full spectrum PLSR for a soil fast-acting phosphorus calibration set according to an embodiment of the present invention;

FIG. 6 is a modified genetic algorithm fitness function F iterative optimization curve according to an embodiment of the present invention;

FIG. 7 shows 25 optimal characteristic wavelength profiles selected by the improved genetic algorithm according to the embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

The characteristic variable selection method of the spectral variable gradient integrated genetic algorithm provided by the invention comprises the following steps (as shown in figure 1):

scanning a plurality of samples by using visible near infrared spectrum scanning equipment to generate a visible near infrared spectrum data matrix, establishing a partial least square regression model for full spectrum wavelength variables contained in the spectrum data matrix, and determining an importance projection coefficient of the full spectrum wavelength variables;

merging the important wavelength intervals of the spectrum data matrix into a random combination characteristic wavelength vector of an interval spectrum as an initial population of a genetic algorithm, and solving a root mean square error of a partial least square regression model;

Establishing a regression model by using the optimal characteristic wavelength vector to perform quantitative analysis, such as determining element content predicted values related to the characteristic wavelength; the characteristic variable selection method and the system based on the spectral variable gradient integrated genetic algorithm can be applied to element quantitative analysis in the fields of fruit sugar content, meat protein content, mineral element content, soil nutrient content and the like.

The method can effectively reduce the probability of selecting the weakly-correlated wavelength variable in the visible near-infrared full spectrum, is beneficial to eliminating the co-linear relation and redundant data, and improves the prediction precision of the material content.

The variable importance projection coefficient (VIP) is one of the important output parameters of the PLSR model, which reflects the PLSR model's score for each independent variable. It is generally considered that when VIP of a certain wavelength variable for establishing a PLSR model by using a correction set visible near infrared full spectrum wavelength variable is greater than 1, it indicates that the spectral wavelength variable has an important role in predicting a target variable.

The method can further reduce the important wavelength interval after obtaining the important wavelength interval to obtain a new important wavelength interval, further improve the probability of selecting the wavelength variable with high correlation, more effectively eliminate the co-linear relation and redundant data, and ensure that the prediction precision of the established regression model is better.

According to the above method, the first embodiment of the present invention: the provided characteristic variable selection method for the spectrum variable gradient integrated genetic algorithm is a flow chart, and the embodiment aims at soil element content analysis, can be applied to software design of a soil near-infrared spectrum analyzer, and comprises the following steps:

step 1, scanning m soil samples through n visible near infrared spectrum wavelengths to generate an mxn-order visible near infrared spectrum data matrix sample set, preprocessing to eliminate noise influence, and dividing the spectral data matrix sample set into a correction set and a verification set. And (3) using the correction set as a training sample, and using the verification set to verify the final prediction effect: the content of the element to be detected in the spectral data matrix sample set can adopt a reference method obtained by other prior art to test a value;

step 2: the full spectrum is divided into s wavelength intervals. In a specific embodiment, the division method may be implemented by using the prior art, for example, the most classical method is to equally divide a full spectrum into N wavelength intervals, which is not described herein in detail for the prior art; the preferred wavelength interval division method provided in the following embodiments may also be employed in other embodiments;

and 3, establishing a partial least squares regression model (PLSR) by using the full spectrum wavelength Variable of the correction set, and outputting a Variable incidence in projection coefficient (VIP) of the importance of the full spectrum wavelength Variable.

And step 4, taking the VIP of each wavelength variable larger than the predetermined value (the embodiment is set to 1, and other embodiments may set other values as required, such as 0.5 or 0.8, etc.) as the important wavelength interval extraction criterion, and extracting k incompletely continuous important wavelength intervals (k < s) only including the wavelength variable with the VIP larger than the predetermined value from the s wavelength intervals.

Step 5, combining the sequences of the important wavelength intervals of the spectrum data matrix into an interval spectrum (in the embodiment, combining the sequences of the important wavelength intervals of the spectrum data matrix into an interval spectrum, and in other embodiments, obtaining the interval spectrum by adopting other combining modes), taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root-mean-square error of the partial least squares regression model;

step 6, setting the population scale of the genetic algorithm, the number of characteristic wavelengths, and a fitness function taking the reciprocal of the root mean square error of the partial least squares regression model as a characteristic wavelength vector, and selecting the characteristic wavelength vector with the maximum fitness value as an optimal characteristic wavelength vector; selecting, crossing and mutating the initial population, and replacing the original population with the obtained new individual to form a new population; iterating to an evolutionary algebra, and outputting a final optimal characteristic wavelength vector; and establishing a regression model by using the optimal characteristic wavelength vector, and solving the predicted value of the content of a certain substance in a certain sample.

The optimal characteristic wavelength vector obtained by adopting the spectral variable gradient integrated genetic algorithm characteristic variable selection method provided by the invention is used for establishing the regression model, so that the structure of the regression model can be simplified, the precision of near infrared spectrum analysis is improved, the regression model has better generalization capability and robustness, the method can be applied to software design of a near infrared spectrometer, has the characteristics of simple realization, simultaneous determination of multiple components, high analysis speed, low cost, no damage to samples, no consumption of chemical reagents, no environmental pollution and the like, and has good popularization and application prospects in the aspect of detection of contents of substances such as soil nutrients, fruit sugar contents, meat protein contents, mineral element contents and the like.

The regression model is built by utilizing a plurality of groups of observed data (x) of a sample set_i,y_i) To estimate the regression coefficients in the regression equation. The method for establishing the regression model is not limited to the partial least squares regression model, and can be realized by the prior art, preferably by a nonlinear regression model.

For partial least squares regression models, the predicted value of the substance content of a certain sample

With multiple wavelength variables, i.e. multiple linear relationships

ε is the random error, β₀Is a regression constant, beta₁～β_nIs n regression coefficients, x₁～x_nThe characteristic wavelength variable is near infrared spectrum diffuse reflection absorbance data obtained by scanning n characteristic wavelengths; beta is a₁～β_nThe estimation is to find a predicted value by establishing a partial least squares regression modelRegression coefficients corresponding to the minimum root mean square error between the sample reference method test values yi: beta is a₁～β_n。

The second embodiment: on the basis of the above embodiment, in order to further improve the probability of selecting a wavelength variable with high correlation, and more effectively remove the co-linear relationship and redundant data, so that the prediction accuracy of the obtained regression model is better, the method further includes: after obtaining the important wavelength intervals, removing one wavelength variable in each selected wavelength interval to only leave the last wavelength variable by using a backward interval partial least squares regression algorithm, finding a combined wavelength vector corresponding to the minimum root mean square error of the partial least squares regression model in each important wavelength interval, constructing each new important wavelength interval, sequentially combining the new important wavelength intervals into an interval spectrum (in the embodiment, sequentially combining the important wavelength intervals of the spectrum data matrix into an interval spectrum, in other embodiments, other combination modes can be adopted to obtain the interval spectrum), taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least squares regression model.

The third embodiment is as follows: on the basis of the above embodiment, in order to more accurately reflect the contribution size of the wavelength variable to the prediction target variable, the method further includes dividing the full spectrum into s wavelength intervals by the following method:

calculating a purity row vector of the full-spectrum wavelength variable and a linear purity gradient vector of the full-spectrum wavelength variable in the horizontal direction; and dividing the full spectrum into a plurality of wavelength intervals by using the positive and negative changes of the gradient value in the linear purity gradient vector of the wavelength variable of the correction set full spectrum. The wavelength interval division method adopted by the embodiment is more scientific than the traditional division method for artificially dividing the full spectrum into N equally-spaced wavelength intervals, because the positive and negative changes of the purity gradient of the wavelength variable mean the change trend of useful information in the spectrum data, the wavelength interval is divided by the positive and negative change points of the linear purity gradient of the wavelength variable, and the wavelength interval with strong interpretability on the target variable can be more scientifically divided.

Particular embodiments employ a concentration gradient method to divide a spectral data matrix sample set into a correction set and a validation set. The correction set full-spectrum wavelength variable purity row vector is a row vector formed by taking each wavelength variable purity value as an element, one wavelength variable purity value is equal to the standard deviation of the spectral data column vector generated by scanning all samples by each visible near-infrared spectrum wavelength and is divided by the average value of the spectral data column vector, and the calculation formula is as follows: p is a radical of_i＝σ_i/μ_i(i ═ 1 to n), where p_iPurity, σ, defined as the ith spectral wavelength variable_iIs the standard deviation, μ, of all data samples at the ith spectral wavelength_iIs the average value of all data samples at the ith spectral wavelength, and n is the purity row vector order of the full spectral wavelength variable.

The magnitude of the wavelength variable purity value reflects the magnitude of the contribution of the wavelength variable to the predicted target variable.

The purity gradient row vector of the full-spectrum wavelength variable of the correction set is a row vector formed by taking two adjacent purity values in the purity row vector of the full-spectrum wavelength variable from left to right in the horizontal direction as elements, and the calculation formula of each purity gradient value is as follows:

the 1 st element being g₁＝p₁-p₂The element in the ith column is g_i＝(p_i+1-p_i-1) I is more than or equal to 2 and less than or equal to n-1, and the nth element is g_n＝p_n-p_n-1. Wherein p is₁，p₂，p_i-1，p_i+1，p_n-1，p_nIs a full spectrum variable purity row vector P ═ P₁,p₂,...,p_n]Medium purity element, g₁，g_i，g_nRespectively represent the purity gradient elements of the 1 st, ith and nth columns in the wavelength variable purity gradient vector, and n is the order of the full spectrum wavelength variable purity gradient row vector. The wavelength variable purity gradient value reflects the change rate of the spectral variable purity value.

The larger the wavelength variable purity gradient value is, the larger the contribution of the wavelength variable to the prediction target variable is, and the higher the possibility of finding the potential characteristic variable is. If a spectral variant purity gradient is positive, it indicates that the spectral variant purity change is positive at this wavelength point, and vice versa. If the purity gradient value of a certain spectral variable is zero, the wavelength variable does not contribute much to the prediction target variable.

In a fourth embodiment, a method flow of the embodiment is shown in fig. 2. On the basis of the above embodiment, in order to solve the problems of long time for modeling near infrared spectroscopy analysis, weak generalization capability of a model, low prediction accuracy and the like caused by the fact that spectral variables are many, spectral information is easy to overlap, data redundancy and a large amount of noise exist, an improved genetic algorithm is adopted to combine the important wavelength interval sequences provided by the invention into an interval spectrum to select characteristic wavelengths (in the embodiment, the important wavelength interval sequences of the spectral data matrix are combined into an interval spectrum, and in other embodiments, other combination modes can be adopted to obtain the interval spectrum), and a real number coding mode is adopted for chromosomes. The number of the optimal characteristic wavelengths is set to be constant between 15 and 100, and the evolution algebra is set to be constant between 100 and 200.

The improved genetic algorithm adopts an improved real number coding differential mutation operator, and the calculation formula is as follows:

Z(i,j)＝D×(E(r1,j)-E(r2,j))+E(i,j)，

wherein Z (i, j) represents a real number-encoded offspring value of the j-th chromosome of the i-th individual, D represents a mutation factor, E (r1, j) represents a real number-encoded parent value of the j-th chromosome of the r 1-th individual randomly generated in the population, E (r2, j) represents a real number-encoded parent value of the j-th chromosome of the r 2-th individual randomly generated in the population, and E (r1, j) -E (r2, j) represents a difference value between the real number-encoded parent value of the j-th chromosome of the r 1-th individual and the real number-encoded parent value of the j-th chromosome of the r 2-th individual. E (i, j) represents the real number-encoding parent value of the j-th chromosome of the i-th individual.

The improved difference mutation operator enlarges the global optimal solution searching space, enables the improved genetic algorithm to search the global optimal solution, and has high convergence speed.

According to the embodiment, wavelength intervals are divided by the positive and negative change times of the purity gradient value of the wavelength variable of the correction set visible near infrared full spectrum, important wavelength intervals are extracted by using a variable projection importance coefficient output by a partial least squares regression model (PLSR) to be larger than 1, new important wavelength intervals are screened in the important wavelength intervals by a backward interval PLS regression algorithm (BiPLSR), and all the new important wavelength intervals are combined into an interval spectrum. And (3) applying an improved genetic algorithm to select the characteristic wavelength vector corresponding to the minimum Root Mean Square Error (RMSE) of the PLSR model in the interval spectrum as an optimal characteristic wavelength variable. According to the embodiment, the probability of selecting the wavelength variable with strong correlation in the interval spectrum by the improved genetic algorithm is improved, the probability of selecting the wavelength variable with weak correlation in the visible near-infrared full spectrum is reduced, the elimination of the co-linear relation and redundant data is facilitated, and the prediction accuracy of the regression model is improved.

The fifth embodiment: in order to further enable the characteristic wavelength vector population generation mode and the spectrum matrix data acquisition method to be simple and easy, on the basis of the embodiment, important wavelength intervals with the important projection coefficients of wavelength variables larger than a preset value are extracted from all wavelength intervals and are sequentially combined into an interval spectrum; in this embodiment, the important wavelength intervals of the spectrum data matrix are sequentially combined into one interval spectrum, and in other embodiments, other combination methods may be adopted to obtain the interval spectrum.

The specific method for sequentially combining the important wavelength intervals into an interval spectrum is as follows:

converting the wavelength column numbers in all the important wavelength intervals into wavelength index number row vectors of interval spectrums; the column number range of the wavelength index number row vector of the interval spectrum is a value range of characteristic wavelength vector elements, each column data of the spectrum data matrix is obtained through a mapping table of the column number and the interval spectrum wavelength index number row vector, and a partial least square regression model is established, so that a characteristic wavelength vector population generation mode and a spectrum matrix data obtaining method are simple and easy to implement.

Embodiment six: on the basis of the above embodiment, each column of data of the correction set data matrix is obtained, and the reciprocal (1/RMSE) of the Root Mean Square Error (RMSE) of the partial least squares regression model is established as a fitness function. The fitness function F is calculated as follows:

F＝1/RMSE，

wherein, y_iThe reference method test value for the ith sample of the calibration set is prior art;

predicted value of partial least squares regression model for each wavelength variable of ith sample of correction set, n_pThe number of samples in the correction set.

In the specific embodiment, the relevant parameters may also be adjusted according to the actual application, for example, by changing the chromosome and length of the genetic algorithm, or further processing the root mean square error of the partial least squares regression model of each wavelength variable of the ith sample, and accordingly adjusting the fitness function.

The following is experimental data for the specific embodiment shown in fig. 2:

the wavelength range of the visible near infrared spectrum is 350-2500 nm. 193 parts of soil samples are scanned by a spectrometer by using a visible near infrared spectrum with the resolution set to be 1nm and the wavelength range set to be 350nm to 1655nm (1306 wavelengths), and a 193 multiplied by 1306 soil quick-acting phosphorus diffuse reflectance spectrum data matrix sample set is generated. After the diffuse reflectance spectral data sample set is preprocessed, a concentration gradient method sample division method is adopted, 193 parts of spectral data matrix sample set is divided into 157 parts of correction set samples and 36 parts of verification set samples according to the proportion of 3:1, the quick-acting phosphorus content reference method test value statistical data of 193 parts of soil samples are shown in table 1, the table 1 is a quick-acting phosphorus content reference method test value statistical data table of 193 parts of soil samples, as can be seen from table 1, the correction set and the verification set of the soil quick-acting phosphorus content reference method test value data samples are divided into similar standard deviation distribution characteristics, but the dispersion is large. The reference method test value refers to a test value for the content of a substance by a chemical method or other methods.

TABLE 1193 quick-acting phosphorus content reference method test value statistical data table of soil sample

Then, the visible near-infrared full-spectrum wavelength variable purity row vector of the soil quick-acting phosphorus correction set is calculated, and a visible near-infrared full-spectrum wavelength variable purity curve of the soil quick-acting phosphorus correction set is shown in fig. 3.

And then calculating the purity gradient row vector of the visible near infrared all-spectral wavelength variable purity row vector of the soil quick-acting phosphorus correction set in the horizontal direction, wherein a visible near infrared all-spectral wavelength variable purity gradient curve of the soil quick-acting phosphorus correction set is shown in figure 4. As can be seen from fig. 4, the peak wavelength range of the near-infrared full spectrum wavelength variable purity gradient curve of the soil fast-acting phosphorus calibration set can be divided into 3: the wavelength range of the maximum peak is 800-1200 nm, the wavelength range of the medium peak is 1200-1655 nm, and the wavelength range of the small peak is 350-800 nm.

And dividing a full spectrum interval into a plurality of unequally spaced wavelength intervals by using the positive and negative change times of the purity gradient element values in the visible near-infrared full spectrum variable purity gradient vector of the soil quick-acting phosphorus correction set.

And (3) establishing a PLSR model by using the visible near-infrared full spectrum wavelength variable of the soil fast-acting phosphorus correction set, and outputting a full spectrum wavelength variable importance projection coefficient (VIP), wherein a VIP curve of the visible near-infrared full spectrum PLSR of the soil fast-acting phosphorus correction set is shown in a graph 5.

Taking VIP greater than 1 of each wavelength variable as an important wavelength interval extraction criterion, extracting any wavelength interval containing the wavelength variable VIP greater than 1 as an important wavelength interval, converting the wavelength column numbers of all the important wavelength intervals into wavelength index number row vectors, and sequentially merging the wavelength index number row vectors into an interval spectral wavelength index number row vector (in the embodiment, the wavelength column numbers of all the important wavelength intervals are converted into wavelength index number row vectors, and sequentially merged into the interval spectral wavelength index number row vectors, and in other embodiments, other merging manners can be adopted to obtain the interval spectral wavelength index number row vectors).

Finally, setting the population scale of the improved genetic algorithm as 100, the number of characteristic wavelengths as 25, the variation range of the row vector and the column number of the spectral wavelength index number of the important interval as the variation space of the characteristic wavelength index number, and the evolution algebra as 100, acquiring all lines of data of the soil rapid-acting phosphorus correction set visible near infrared spectrum data matrix through the characteristic wavelength index number, establishing a PLSR model, and taking the reciprocal of the Root Mean Square Error (RMSE) of the PLSR model as a fitness function F of the characteristic wavelength vector individuals, wherein the iterative optimization curve of the fitness function F is shown in FIG. 6. The 25 optimal wavelength characteristic index numbers selected by the improved genetic algorithm are converted into 25 optimal characteristic wavelength values, as shown in table 2, and table 2 shows the 25 optimal characteristic wavelength values of the soil rapid-acting phosphorus visible near infrared spectrum selected by the improved genetic algorithm. The conversion formula between the wavelength index and the wavelength value is: wavelength value-wavelength index No. +350 (nm).

TABLE 2 soil available phosphorus visible near infrared spectrum 25 optimum characteristic wavelength values selected by improved genetic algorithm

The 25 optimal characteristic wavelength distribution maps selected by the improved genetic algorithm are shown in fig. 7, so that 13 optimal characteristic wavelengths (855nm, … nm and 1198nm) in the wavelength range 800-1200 nm of the maximum peak value of the spectral wavelength variable purity gradient, 1 optimal characteristic wavelength (1398nm) in the wavelength range 1200-1655 nm of the medium peak value of the spectral wavelength variable purity gradient, and 11 optimal characteristic wavelengths (360nm, … nm and 498nm) in the wavelength range 350-800 nm of the small peak value of the spectral wavelength variable purity gradient are shown, and the accuracy of dividing the important interval spectrum by the visible near infrared spectrum variable gradient integrated genetic algorithm characteristic wavelength selection method is proved.

The beneficial effects of the embodiment of the invention are as follows: the method is characterized in that the wavelength intervals are divided by using the visible near-infrared full-spectrum wavelength variable purity gradient value of the soil quick-acting phosphorus correction set, and the important wavelength intervals with strong interpretability for predicting the content of the soil quick-acting phosphorus are extracted to form an interval spectrum when the VIP of a PLSR model is greater than 1, so that the probability of selecting potential characteristic variables in the interval spectrum by improving a genetic algorithm is greatly increased, the structure of a regression model is simplified, the calculated amount is reduced, and the prediction accuracy of the content of the soil quick-acting phosphorus is improved.

The implementation mode is as follows: a system for selecting characteristic wavelengths of a spectral variable gradient integrated genetic algorithm, comprising:

the partial least square regression model establishing module is used for utilizing visible near infrared spectrum scanning equipment to scan a spectrum data matrix generated by a plurality of samples, establishing a partial least square regression model for a full spectrum wavelength variable contained in the spectrum data matrix and determining an importance projection coefficient of the full spectrum wavelength variable;

the genetic algorithm selection module is used for combining the important wavelength intervals of the spectrum data matrix into an interval spectrum, taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least square regression model; taking the reciprocal of the root mean square error of the partial least squares regression model as a fitness function of the characteristic wavelength vector, and selecting the characteristic wavelength vector with the maximum fitness value as an optimal characteristic wavelength vector; selecting, crossing and mutating the initial population, and replacing the original population with the obtained new individual to form a new population; and (5) iterating to an evolutionary algebra, and outputting a final optimal characteristic wavelength vector.

The optimal characteristic wavelength vector is used for establishing the regression model to predict the content of the substance, so that the prediction precision of the regression model can be effectively improved, the structure of the regression model is simplified, and the generalization capability and the robustness of the regression model are better. A new method for selecting characteristic wavelength variables is provided for the design of a near infrared spectrum analyzer.

On the basis of the above embodiment, the important wavelength interval determining module further includes removing one wavelength variable in each selected wavelength interval to only leave the last wavelength variable by using a backward interval partial least squares regression algorithm after obtaining the important wavelength interval, finding a combined wavelength vector corresponding to the minimum root mean square error of the partial least squares regression model in each important wavelength interval, and constructing each new important wavelength interval.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A characteristic wavelength selection method of a spectral variable gradient integrated genetic algorithm is characterized by comprising the following steps:

scanning a plurality of samples by using visible near infrared spectrum scanning equipment to generate a visible near infrared spectrum data matrix, establishing a partial least square regression model for full spectrum wavelength variables contained in the visible near infrared spectrum data matrix, and determining importance projection coefficients of the full spectrum wavelength variables;

dividing the full spectrum of the visible near infrared spectrum data matrix into a plurality of wavelength intervals, extracting important wavelength intervals with wavelength variables and important projection coefficients larger than a preset value from all the wavelength intervals, and combining the important wavelength intervals into an interval spectrum;

taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least square regression model;

2. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: after the important wavelength intervals are obtained, removing one wavelength variable in each important wavelength interval to only leave the last wavelength variable by using a backward interval partial least square regression algorithm, searching a wavelength combination vector corresponding to the minimum root mean square error of the partial least square regression model in each important wavelength interval, constructing each new important wavelength interval and combining the new important wavelength intervals into an interval spectrum, taking the random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving the root mean square error of the partial least square regression model.

3. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: the method for dividing the full spectrum into a plurality of wavelength intervals is as follows:

calculating a purity row vector of the full-spectrum wavelength variable and a purity gradient vector of the full-spectrum wavelength variable in the horizontal direction; and dividing the full spectrum into a plurality of wavelength intervals by using the positive and negative changes of the gradient value in the gradient vector of the full spectrum wavelength variable purity.

4. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: the expression of the fitness function F of the characteristic wavelength vector is as follows:

F＝1/RMSE，

wherein RMSE establishes the root mean square error of a partial least squares regression model for full spectrum data matrix column data, y_iThe reference method test value for the ith sample,predicted value of partial least squares regression model for each characteristic wavelength variable of ith sample, n_pIs the number of samples.

5. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: forming a new population according to the population size, the cross probability, the mutation probability and the selection probability of the selected genetic algorithm, wherein the mutation operator adopts a real number coding differential mutation operator, and the calculation formula is as follows:

Z(i,j)＝D×(E(r1,j)-E(r2,j))+E(i,j)，

6. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: the method for extracting important wavelength intervals with the important projection coefficients of the wavelength variables larger than a preset value from all the wavelength intervals and combining the important wavelength intervals into an interval spectrum comprises the following steps:

converting the wavelength column numbers in all the important wavelength intervals into wavelength index number row vectors of interval spectrums; and the column number range of the wavelength index number row vector of the interval spectrum is the value range of the characteristic wavelength vector elements, and each column of data of the spectrum data matrix is obtained through a mapping table of the column number and the interval spectrum wavelength index number row vector.

7. The method for selecting the characteristic wavelength of the spectral variable gradient integrated genetic algorithm according to claim 1, wherein the method comprises the following steps: the number of the optimal characteristic wavelengths is set to be constant between 15 and 100, and the evolution algebra is set to be constant between 100 and 200.

8. A system for selecting characteristic wavelengths of a spectral variable gradient integrated genetic algorithm, comprising:

the partial least square regression model establishing module is used for scanning a plurality of samples by utilizing visible near infrared spectrum scanning equipment to generate a visible near infrared spectrum data matrix, establishing a partial least square regression model for a full spectrum wavelength variable contained in the visible near infrared spectrum data matrix, and determining an importance projection coefficient of the full spectrum wavelength variable;

the genetic algorithm selection module is used for combining important wavelength intervals of the spectrum data matrix into an interval spectrum, taking a random combination characteristic wavelength vector of the interval spectrum as an initial population of the genetic algorithm, and solving a root mean square error of a partial least square regression model; taking the reciprocal of the root mean square error of the partial least squares regression model as a fitness function of the characteristic wavelength vector, and selecting the characteristic wavelength vector with the maximum fitness value as an optimal characteristic wavelength vector; selecting, crossing and mutating the initial population, and replacing the original population with the obtained new individual to form a new population; and (5) iterating to an evolutionary algebra, and outputting a final optimal characteristic wavelength vector.

9. The system of claim 8, wherein the significant wavelength interval determining module further comprises a step of removing one wavelength variable in each selected wavelength interval to only the last wavelength variable by using a backward interval partial least squares regression algorithm after obtaining the significant wavelength interval, and finding a combined wavelength vector corresponding to the minimum root mean square error of the partial least squares regression model in each significant wavelength interval to construct each new significant wavelength interval.

10. The system of claim 8, wherein the spectral data matrix comprises a calibration set sample spectral data matrix, a calibration set sample reference method test value matrix, a validation set sample spectral data matrix, and a validation set sample reference method test value matrix.