Open AccessArticle

Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method

Er Wang

^1,2,

Tianbao Huang

^1,2,

Zhi Liu

^1,2,

Lei Bao

^1,2

Binbing Guo

^1,2,

Zhibo Yu

^1,2,

Zihang Feng

^1,2,

Hongbin Luo

^1,2

and

Guanglong Ou

^1,2,*

Key Laboratory of National Forestry and Grassland Administration on Biodiversity Conservation in Southwest China, Southwest Forestry University, Kunming 650233, China

Key Laboratory for Forest Resources Conservation and Utilization in the Southwest Mountains of China, Ministry of Education, Southwest Forestry University, Kunming 650233, China

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(23), 4497; https://doi.org/10.3390/rs16234497

Submission received: 30 September 2024 / Revised: 22 November 2024 / Accepted: 26 November 2024 / Published: 30 November 2024

Download

Browse Figures

Figure 1
Technology roadmap for this study. "> Figure 2
The study area and sample plot distribution: (a) The location of Zhenyuan in Yunnan Province; (b) Six Types of Remote Sensing Imagery; (c) Remote sensing image data of Wuyi Village. "> Figure 3
Data distribution for the original dataset (60 samples), training set (42 samples), and test set (18 samples). "> Figure 4
Results of variable selection: (a) Boruta’s variable selection results by comparing the shaded features with the original feature evaluation; (b) Lasso regularized compression of the eigenvectors obtained from the; (c) Lasso Variable Selection Results with GA variable selection Re-used in the Lasso variable selection case; (d) Results of variable selection with correlation coefficients greater than 0.5 between remote sensing factors and forest AGBs; (e) RFIS variable importance value selection results for each remote sensing factor; (f) Lasso variable selection results in the case of removing multicollinear remote sensing factors using VIF. "> Figure 5
Scatterplots of forest AGB model test set fit using 8 algorithms for 6 variable choices. "> Figure 6
The results of 6 variable selection results in 8 machine learning in the test set R2 fitting results. "> Figure 7
AGB inversion plot using 8 algorithms with 6 types of variable selection. ">

Versions Notes

Abstract

Estimation of forest above-ground biomass (AGB) using multi-source remote sensing data is an important method to improve the accuracy of the estimate. However, selecting remote sensing factors that can effectively improve the accuracy of forest AGB estimation from a large amount of data is a challenge when the sample size is small. In this regard, the Least Absolute Shrinkage and Selection Operator (Lasso) has advantages for extensive redundant variables but still has some drawbacks. To address this, the study introduces two Least Absolute Shrinkage and Selection Operator Lasso-based variable selection methods: Least Absolute Shrinkage and Selection Operator Genetic Algorithm (Lasso-GA) and Variance Inflation Factor Least Absolute Shrinkage and Selection Operator (VIF-Lasso). Sentinel 2, Sentinel 1, Landsat 8 OLI, ALOS-2 PALSAR-2, Light Detection and Ranging, and Digital Elevation Model (DEM) data were used in this study. In order to explore the variable selection capabilities of Lasso-GA and VIF-Lasso for remote sensing estimation of forest AGB. It compares Lasso-GA and VIF-Lasso with Boruta, Random Forest Importance Selection, Pearson Correlation, and Lasso for selecting remote sensing factors. Additionally, it employs eight machine learning models—Random Forest (RF), Extreme Gradient Boosting (XGBoost), Support Vector Machine (SVM), Bayesian Regression Neural Network (BRNN), Elastic Net (EN), K-Nearest Neighbors (KNN), Extremely Randomized Trees (ETR), and Stochastic Gradient Boosting (SGBoost)—to estimate forest AGB in Wuyi Village, Zhenyuan County. The results showed that the optimized Lasso variable selection could improve the accuracy of forest biomass estimation. The VIF-Lasso method results in a BRNN model with an R² of 0.75 and an RMSE of 16.48 Mg/ha. The Lasso-GA method results in an ETR model with an R² of 0.73 and an RMSE of 16.70 Mg/ha. Compared to the optimal SGBoost model with the Lasso variable selection method (R² of 0.69, RMSE of 18.63 Mg/ha), the VIF-Lasso method improves R² by 0.06 and reduces RMSE by 2.15 Mg/ha, while the Lasso-GA method improves R² by 0.04 and reduces RMSE by 1.93 Mg/ha. From another perspective, they also demonstrated that the RX sample count and sensitivity provided by LiDAR, as well as the Horizontal Transmit, Vertical Receive provided by Microwave Radar, along with the feature variables (Mean, Contrast, and Correlation) calculated from the Green, Red, and NIR bands of optical remote sensing in 7 × 7 and 5 × 5 windows, play an important role in forest AGB estimation. Therefore, the optimized Lasso variable selection method shows strong potential for forest AGB estimation using multi-source remote sensing data.

Keywords:

forest aboveground biomass; machine learning algorithm; remote sensing; Lasso optimization; small sample

1. Introduction

Forest above-ground biomass (AGB), as a qualitative and quantitative indicator of forest health, plays a crucial role in mitigating global warming [1,2]. Traditional field methods for obtaining AGB are time-consuming, labor-intensive, and can damage the ecological environment. Remote sensing offers a valuable alternative, addressing these limitations [3,4].

Remote sensing for forest AGB estimation is a popular research area, but challenges remain, including low accuracy and limited applicability, particularly in complex terrains when using space-borne sensors. Remote sensing for forest AGB estimation typically uses optical images, Light Detection and Ranging (LiDAR), and Microwave Radar (MR) [5]. Optical remote sensing data can provide a sensitive picture of vegetation growth and forest biomass through reflectance characteristics in the visible and near-infrared bands. It is suitable for flat areas with simple forest types and no shade. Lu et al. [6] used Landsat Thematic Mapper (TM) data to estimate above-ground biomass (AGB) in the eastern and western parts of the Brazilian Amazon and analyzed the effect of stand structure on AGB estimation. The study found that AGB estimation remains challenging in areas with complex biophysical environments. Sarker et al. [7] investigated the relationship between biomass and texture images of secondary and mature forests in the state of Rondônia, Brazil, based on texture measurements and spectral bands of the GLCM. The results showed that the relationship between texture image and biomass was stronger in mature forests with complex stand structures than in the original spectral bands, but the relationship was inversely proportional in secondary forests because of their relatively simple stand structure. While optical remote sensing offers detailed spectral information and is widely used for forest AGB estimation, it is affected by atmospheric conditions and clouds and may suffer from data saturation [8]. In addition, optical remote sensing only works during daylight hours. LiDAR data provide detailed information on the vertical structure of forests, which is particularly important for complex forest types (tropical rainforests and alpine forests) Indirabai et al. [9] validated satellite LiDAR data as an effective tool for estimating aboveground biomass, especially in complex tropical forest ecosystems, through tests in Betul and Mudumalai forests in India. Padalia [10] et al. combined GEDI data with optical and SAR remote sensing data to estimate above-ground biomass (AGB) of managed tropical forests in the foothills of the Indian Himalayas. Spatial canopy height predictions were obtained by combining the canopy height of the GEDI footprint with the Landsat 8 index through the Random Forest (RF) method, with an R² of 0.97 and an RMSE of 2.32 m. Improved GEDI data combined with SAR improved forest AGB estimation with an R² of 0.77 from 0.61 LiDAR data not only avoids light saturation but is also unaffected by weather conditions [11,12]. However, LiDAR data often does not cover the entire study area and converting the sparse point measurements into continuous data is challenging, especially in dense stands where branches can scatter or absorb the laser beams [13]. MR overcomes the limitations of sensors, with L-band and C-band SAR being suitable for aboveground biomass in tropical forests, but it still faces challenges in complex terrain. In such areas, MR signals are affected by terrain fluctuations and may experience data saturation [14,15]. Englhart et al. [16] evaluated the effectiveness of X-band and L-band SAR data for estimating AGB in Kalimantan, Borneo. The results showed that ALOS PALSAR was more sensitive to AGB than TerraSAR-X in the high biomass range, while the estimation accuracy was poor in the low biomass range. The combined multi-temporal L-band and X-band model performed best in terms of accuracy, with a validation result of r² = 0.53 and an RMSE of 79 t/ha. Therefore, forest AGB estimation requires the integration of multi-source remote sensing data as single satellite sensors subject to natural factors and their own limitations [17].

Presently, many scholars have also conducted a lot of research on integrating multi-source remote sensing data to estimate forest AGB [18]. In Vafaei et al. [19], forest AGB was estimated using Sentinel 2A and ALOS-2 PALSAR-2 images. The results showed that the R² of the Sentinel 2A and ALOS-2 PALSAR-2 model estimates were 0.70 and 0.23, respectively, and the R² of integrating the Sentinel 2A and ALOS-2 PALSAR-2 estimates of forest AGB was 0.73. David et al. [20] estimated the dynamic biomass of dryland forests in southern Africa by combining Sentinel 1 and Sentinel 2 data. Their results show that integrating Sentinel 1 and Sentinel 2 can significantly improve the estimation of forest AGB with an R² of 0.95. Zhao et al. [21] forest AGB estimation by combining Landsat TM and ALOS PALSAR remote sensing data were performed, and it was shown that the variance in forest AGB estimates was reduced through the integration of the two images. In Chen et al. [22], forest AGB in highly heterogeneous areas was estimated using multiple remote sensing data types. The study showed that it is possible to obtain highly accurate maps of forest AGB, and by combining LiDAR, Landsat, and Synthetic Aperture Radar satellite data. In Tamiminia et al. [23], the 10 m wall-to-wall CHM dataset can be obtained by combining GEDI, Sentinel 1, and Sentinel 2 remotely sensed data, enhancing the estimation of forest AGB in New York State. The study showed that Forest AGB estimates should be based on a combination of GEDI, Sentinel 1A/B, and Sentinel 2 data. Vafaei et al. [19] and David et al. [20], Zhao et al. [21] combined only two of the three remote sensing telemetry types, and the uncertainty of their results remains to be investigated, but they also confirmed that synergizing multi-source remote sensing data can improve forest AGB estimation. Chen et al. [22] and Tamiminia et al. [23] synergize three remote sensing data sources to obtain not only forest biomass estimation in highly heterogeneous areas but also a more accurate CHM dataset. The accuracy of their forest AGB estimates might have been improved if they had used more critical remote sensing features. In conclusion, the accuracy of forest AGB estimation can be improved by combining optical remote sensing, RM, and GEDI data, because integrating multi-source remote sensing data estimation can overcome the one-sidedness of single image characterization. Therefore, this study collaborated with three remote sensing data types for estimating forest AGB to identify the more critical remote sensing factor types in forest biomass and further improve the accuracy of forest AGB estimation. Forest AGB estimation accuracy can be improved through the integration of multi-source remote sensing, but data redundancy is a problem.

The selection of effective variables is an important challenge in integrating multi-source remote sensing for forest biomass estimation [24]. In variable selection, Least Absolute Shrinkage and Selection Operator (Lasso), which was originally proposed by Tibshirani, has been introduced with an L1 regularization term that causes regression coefficients to be shrunk to zero, thereby enabling variables to be selected and model interpretability to be enhanced. Additionally, the penalty intensity in Lasso is controlled through λ, which helps to address multicollinearity issues in high-dimensional data and enhances the model’s robustness [25,26]. The effective variable selection capability of Lasso in forest AGB estimation using multi-source remote sensing is utilized widely in various fields. For example, in Zandler et al. [27], the validity of variable selection results was explored using six empirical models for quantifying shrub biomass in arid environments. The results showed that the remote sensing factors selected by Lasso had the best modeling performance. In Lazaridis et al. [28], tree mortality was estimated using vegetation indices from remote sensing, and variable selection was performed using ridge regression, Lasso, and partial least squares. The study showed that the variable selection method based on Lasso was the most accurate. Zhang et al. [29] found that higher model accuracy was obtained with the Lasso-based variable selection method compared to the other methods in the remote sensing estimation of forest biomass using the integrated stacking algorithm. In Shafiee et al. [30], spring wheat yield prediction was performed using unmanned aircraft imagery, the analysis shows that the Lasso variable selection method is more effective than the sequence forward selection method. It is shown that the Lasso variable selection method is effective in identifying important features in high-dimensional datasets. Compared to other methods, Lasso not only reduces the dimensionality of the features and prevents overfitting, but also has an advantage in computation time because it can quickly filter out key variables through sparse solutions. In addition, the Lasso method is also robust to multicollinearity in high-dimensional data and can provide effective variable selection results in a short time.

However, the Lasso variable selection method may select highly collinear redundant features in small sample high-dimensional datasets, which can reduce the predictive power and stability of subsequent models, thereby affecting the model’s performance and interpretability [31]. For that, the Variance Inflation Factor (VIF) provides an effective means to measure the covariance between features. When there is a strong covariance between a feature and other features, its VIF value will be higher affecting the contribution of other features in variable selection [32,33]. Therefore, screening the initial feature data using the VIF value can eliminate the features with high multicollinearity in the original dataset and attenuate the impact of Lasso variable selection due to multicollinearity. The Genetic Algorithm (GA) can be used to account for the presence of uncorrelated features in the results of Lasso variable selection. The GA identifies the optimal subset of features by generating new offspring through processes such as reproduction, crossover, and mutation, and eliminating irrelevant variables [34,35]. Therefore, to explore whether the accuracy of forest AGB estimation can be improved by these methods. This study explores the problematic Lasso variable selection using VIF and GA to screen the dataset before and after Lasso variable selection. In addition, forest AGB remote sensing estimation is widely applied using common variable methods such as Boruta, Random Forest Importance Selection (RFIS), and Pearson Correlation (PC) method [36]. Meanwhile, machine learning algorithms are frequently used for remote estimation of forest AGB due to their robustness and high accuracy. Specifically, forest AGB estimation is extensively performed using decision trees, Random Forests (RF), Support Vector Machines (SVM), Neural Networks (NN), and other advanced algorithms within remote sensing techniques [37]. The Bidirectional Recurrent Neural Network (BRNN) captures both forward and backward relationships in time or spatial sequences through bidirectional processing, effectively mining complex nonlinear features and enhancing model accuracy [38]. The ERT (Extreme Randomized Trees) integrate multiple randomized decision trees, providing robustness to noise and outliers, making them well-suited for regression prediction tasks [39]. The Elastic Network (EN) combines L1 (Lasso) and L2 (Ridge) regularization techniques to reduce overfitting, particularly in cases with many features that exhibit linear correlations [40]. Support Vector Machines (SVM) excel at handling high-dimensional, nonlinear problems, performing precise regression analysis through kernel tricks [41]. Random Forest (RF) constructs multiple decision trees to address complex nonlinear relationships effectively [42]. XGBoost (Extreme Gradient Boosting) is an efficient ensemble learning algorithm that optimizes model performance continuously through gradient boosting, excelling in regression tasks with complex feature relationships [43]. Finally, the K-nearest Neighbor (KNN) algorithm makes predictions based on the similarity of local data points, rather than assuming a linear model, making it effective for handling nonlinear relationships [44]. Stochastic Gradient Boost (SGBoost) is an enhanced version of gradient boosting that introduces stochasticity during training to improve model generalization and reduce overfitting [45]. The application of these machine learning models allows for a comprehensive assessment of the effectiveness of different variable selection methods. The different machine learning methods capture the complex relationships in the data from linear and nonlinear aspects, respectively, and they also reduce overfitting and enhance the robustness of the model in small sample data through regularization and integration strategies. At the same time, these methods excel in handling high-dimensional feature data, accurately identifying key features and improving the accuracy and reliability of variable selection.

In this study, Sentinel 2, Sentinel 1, Landsat 8 OLI, ALOS-2 PLASAR-2, GEDI, and Digital Elevation Model (DEM) are integrated as data sources. Optimized Lasso with Lasso, Boruta, Random Forest Importance Selection (RFIS), and Pearson Correlation (PC) variable selection methods were used to screen remote sensing factors, which were evaluated by machine learning algorithms. The study aims to explore whether the optimized Lasso variable selection method can improve the accuracy of forest AGB estimation under different machine learning models.

2. Study Area and Materials

Estimating remote sensing factors with high correlation with forest AGB is crucial to improve the estimation accuracy, in the process of estimating forest AGB using multi-source remote sensing. In this study, the VIF-Lasso and Lasso-GA variable selection methods are proposed and their variable selection capabilities are compared with four other common variable selection methods by eight machine learning methods. The technical alignment is shown in Figure 1.

2.1. Study Area

Wuyi Village is located in the northwestern part of Enle Township, Zhenyuan County, Pu’er City, Yunnan Province (longitude 100°56′ to 101°2′E, latitude 24°2′ to 23°57′N) [46]. The study area was located in Wuliang Mountain, and the annual mean temperature was about 20.9 degrees Celsius [47]. Pinus kesiya var. langbianensis is the dominant vegetation in both mountain ranges from 1200 m to 2000 m above sea level, and it is a unique and widely distributed species in the region, which is not only a crucial component of the local ecosystem, providing essential habitat for various organisms, but also offers significant economic and social benefits [48]. The study area is shown in Figure 2.

2.2. Data Collection from Sample Plots and Forest AGB Calculation

A total of 60 P. kesiya var. langbianensis plots with 30 m × 30 m were sampled in March 2023 in Wuyi Village, Puer City, Yunnan Province. The coordinates of individual trees and sample plots were recorded using RTK, and tree diameter at breast height (1.3 m; DBH) and tree height (H) were measured and used to calculate tree biomass. Sample plot parameters are detailed in Table 1. The sample plots collected were located in the distribution area of P. kesiya var. langbianensis, a representative tree species in the study area, and they were minimally affected by anthropogenic disturbance. However, the rugged terrain resulted in some limitations in the spatial distribution of the sample plots [49]. Equations (1)–(4) represent the Allometric Growth Biomass (AGB) models for each tree species in the sample forest, based on a study on estimation and distribution of forest biomass and carbon storage in Yunnan Province [50]. These models have R² values of 0.9528, 0.9861, 0.9664, and 0.880, respectively. The AGB for each plot was calculated using Equation (5) and then converted to hectares using Equation (6).

P. kesiya var. langbianensis:

W_{P} = \sum_{i = 1}^{n_{1}} 0.0582 {D_{i}^{2.1203} H}_{i}^{0.4668}

(1)

Keteleeria fortune:

W_{k} = \sum_{i = 1}^{n_{2}} 0.0729 {(D_{i}^{2} H_{i})}^{0.9334}

(2)

Quercus acutissima:

W_{Q} = \sum_{i = 1}^{n_{3}} 0.1663 {(D_{i}^{2} H_{i})}^{0.7821}

(3)

Broadleaf species:

W_{b} = \sum_{i = 1}^{n_{4}} 0.4531 D_{i}^{2} H_{i} - 37.07

(4)

Total forest AGB of the sample plots:

W_{t} = W_{P} + W_{K} + W_{Q} + W_{b}

(5)

Forest AGB per hectare of the sample site:

W_{h} = \frac{W_{t}}{0.09} \times 1000

(6)

where

n_{1}, n_{2}, n_{3} {, n}_{4}

represent the number of each tree species of P. kesiya var. langbianensis, Keteleeria fortunei, Quercus acutissima, and broadleaf species and

D_{i}

and

H_{i}

are the diameter at breast height (DBH) and height (H) of the

i

tree, respectively.

W_{k}, W_{Q}, W_{b} {, W}_{t}

in the formula indicate the total forest AGB of different species in the sample plot (unit: kg), and

W_{h}

indicates the forest AGB per hectare (unit: Mg/ha).

2.3. Extraction and Variable Screening of Remote Sensing Data

2.3.1. Remote Sensing Data-Acquiring

The study area is characterized by complex topography, variable climatic conditions, and ecosystems that exhibit significant spatial heterogeneity. In addition, this study focuses on coniferous forest species. A simple data source may not be able to adequately capture the structural and distributional characteristics of coniferous forest species under complex topography and variable climate and, thus, affecting the precision of biomass estimation. Therefore, remote sensing data were collected from Sentinel 1, Sentinel 2, GEDI, ALOS-2 PLASRA-2, DEM, and Landsat 8 OLI satellites in this study. The distribution of the data products obtained includes IW SLC, TIRSCI LEVEL-1, L2B, PLASRA-2, and TIRSCI LEVEL-1, as detailed in Table 2.

2.3.2. Pre-Processing

The Landsat 8 OLI data were preprocessed with ENVI version 5.6 for atmospheric, radiometric, and terrain calibration. Sentinel 2 data were atmospherically corrected using Sen2 Cor_v2.5.5 to convert MSIL1C to MSIL2A. The GEDI data processed using Python version 3.7 were converted to vector format and then interpolated using kriging interpolation in ArcGIS version 10.8 to cover the entire study area. Sentinel 1 data were acquired in a south to north direction and with dual polarization mode (3 February 2023). It is characterized by the following: SAFE format, IW SLC level to GRD level, pre-processed with trajectory correction, thermal noise removal, radiometric calibration, multi-view coherent spot filtering, terrain correction and decibel conversion. The ALOS-2 PLASAR-2 was processed in SNAP, including radiometric calibration, radar multi-view processing, speckle noise filtering, decibel conversion, and Keplerian terrain correction, which gave the backscattered intensities in decibels for HH, HV, VH, and VV fully polarized features. A DEM with a spatial resolution of 30 × 30 m from the onboard sensors was used for terrain correction of Sentinel 1, Sentinel 2, Landsat 8 OLI, and ALOS-2 PLASAR-2 data.

There were 48 texture variables in the VV and VH polarization eigenvectors of Sentinel 1 as well as in the Gray Level Covariance Matrix (GLCM) (with window sizes of 3 × 3, 5 × 5, and 7 × 7, respectively) that were used as remotely sensed eigenvectors. The Sentinel 2 data were processed to extract 10 spectral bands, 18 vegetation indices, and 268 texture variables (3 × 3, 5 × 5, and 7 × 7 GLCM) as remotely sensed feature vectors in ENVI version 5.6. In total, 7 spectral bands, 22 vegetation indices, and 168 texture variables (3 × 3, 5 × 5, 7 × 7 GLCM) were extracted as remotely sensed feature vectors using Landsat 8 OLI data. HH, HV, VH, and VV full polarization eigenvectors for 10 ratio values and 24 HV texture-variant (3 × 3, 5 × 5, 7 × 7 GLCM) remote sensing eigenvectors, as well as full polarization eigenvectors were extracted from ALOS-2 PLASAR-2 remote sensing images. In total, 39 GEDI remote sensing feature vectors were extracted using Python 3.7. The remote sensing characteristic factors are shown in Table 3 [12,38,51,52,53].

2.3.3. Variable Selection Methods

(1): Boruta

Boruta is a feature selection method using random forests, which creates a shaded feature set from the original features, creates a new combined dataset from the original and shaded features, and evaluates their importance to obtain the more contributing remote sensing factors [54].

(2): Least Absolute Shrinkage and Selection Operator

The Lasso variable selection can help reduce model complexity and prevent overfitting through regularization [55]. It is implemented in Python using sci-kit-learn’s LassoCV class, and the best λ is automatically selected using 5-fold cross-validation. The λ is obtained by crossover and, thus, balancing the complexity and accuracy of the model.

(3): Random Forest Importance Selection

The RFIS can be used for classification and regression [56]. It is based on constructing multiple subtrees from the decision tree and integrating the importance assessment scores of each feature by training the subtrees. The greater the model’s contribution, the higher the importance of assessment scores. Characterization can enhance model robustness by reducing overfitting and focusing on the most relevant variables. In this study, feature selection is implemented using the RFIS algorithm in Python’s sci-kit-learn (sklearn. ensemble). The top 0.5% of features were selected based on their importance scores. Python modules are highly respected and integrate a number of state-of-the-art machine learning algorithms [57].

(4): Pearson Correlation

The strength and direction of the relationship between variables can be measured in the PC, which can be used to determine which correlation coefficients are statistically significant by setting a threshold [58]. In this experiment, the threshold was set at 0.5. The formula is as follows.

r = \frac{Σ (X_{γ} - \bar{X}) (Y_{γ} - \bar{Y})}{\sqrt{Σ {(X_{γ} - \bar{X})}^{2} Σ {(Y_{γ} - \bar{Y})}^{2}}}

X_{γ}

and

Y_{γ}

denote the

γ

-th observations of the two variables X and Y, respectively.

\bar{X}

and

\bar{Y}

are the means of X and Y, respectively. r is the Pearson correlation coefficient, which ranges from −1 to 1.

(5): VIF-Lasso

VIF is a measure of the degree of multicollinearity between the independent variables [59]. The VIF value indicates the covariance between a particular independent variable and the other independent variables and is calculated as follows:

V I F (X_{j}) = \frac{1}{1 - R_{j}^{2}}

where the

R_{j}

value is the multi-correlation coefficients for regression analysis using the all remained independent variables by assuming the

j

-th variable in the model

X_{j}

as the dependent variable; the larger the VIF value, the stronger the correlation between that independent variable and the other independent variables.

VIF ≤ 10: the variable is usually considered to be free of severe multicollinearity. VIF ≥ 10: indicates that there is a significant problem of covariance between the variables and the variables are removed [60]. However, VIF is only able to identify highly correlated variables and does not further optimize variable selection. The original feature dataset contained 593 variables, with 509 retained after VIF screening (threshold set at 10). Thus, the Lasso regression has a strong feature selection ability. Used in conjunction with VIF, Lasso not only handles independent variables after VIF removal but also eliminates features that do not significantly affect the target variable. Therefore, using VIF to eliminate the effects of covariance, combined with Lasso’s feature selection, should effectively reduce redundant features.

(6): Lasso-GA

Lasso regression performs feature selection by L1 regularization, but the best features may not be efficiently filtered in multicollinearity problems [61]. To solve this problem, GA was introduced to optimize the feature selection process for Lasso regression. First, Lasso regression was used to screen out an initial subset of features as a population initialization for the genetic algorithm. Next, the GA continuously evolves the feature subset through crossover and mutation operations and optimizes the fitness by evaluating its performance in the Lasso regression model. GA was able to select the best subset of features after multiple generations of iterations [62]. The experiments were performed in Python version 3.7. The GA has a population size of 50, an iteration count of 40, a crossover rate of 0.5, and a variance rate of 0.2. The feature selection method, which combines Lasso and the genetic algorithm, effectively reduces redundant variables and overcomes the local optimization problem of Lasso.

2.3.4. Model Construction

This study employs remote sensing data in Python 3.7, utilizing eight distinct machine learning algorithms to model forest AGB. The dataset is partitioned into 70% training and 30% testing sets to ensure independent validation. With consideration of the scientific nature of data division, this study selected 7 samples in each of the intervals of 0–100 Mg/ha and 100–150 Mg/ha and 4 samples in the intervals above 150 Mg/ha according to the distribution of the data, and a total of 18 test samples were randomly selected to ensure that the test set data are uniformly distributed over the range, thereby minimizing the bias in model evaluation caused by uneven sample distribution. As shown in Figure 3 with Table 4. RF serves as a robust ensemble method in this context; The Random Forest Regressor is employed to construct multiple decision trees, enhancing model performance. XGBoost is a sophisticated gradient boosting algorithm incorporating regularization to prevent overfitting, and it is applied through XGBRegressor for efficient data processing. SVM, a well-established supervised learning algorithm, effectively addresses both linear and nonlinear issues. In this study, sklearn.svm.SVR is implemented to identify the optimal regression hyperplane within a high-dimensional space, minimizing prediction error, which is particularly useful for high-dimensional feature data and offers strong generalization. BRNN, which consists of two independent networks for forward and backward processing, is also explored to model complex dependencies within data. Model generalization is improved by fine-tuning parameters, such as epochs, batch_size, and validation_split. EN is leveraged as a linear model that optimizes predictive performance through L1 and L2 regularization. To further enhance EN’s estimation capabilities, parameter optimization is conducted via grid search. KNN a straightforward yet effective method for classification and regression, predicts outcomes based on sample similarity. An automated search determines the optimal K-value (1–20) in this study, refining the model’s accuracy [44]. SGBoost, an enhanced gradient boosting method, introduces randomness in GradientBoostingRegressor during training, which improves generalization and reduces the risk of overfitting. The ETR, similar to Random Forest (RF) but with a more randomized node-splitting approach, is implemented using sklearn. Ensemble.ExtraTreesRegressor is implanted to enhance computational efficiency with default parameters. Employing these eight machine learning algorithms allows for comprehensive assessment of variable selection results, providing a rigorous evaluation of whether VIF-Lasso and Lasso-GA outperform alternative variable selection methods in filtering efficiency. This comparison facilitates a thorough validation of the predictive effectiveness of the selected variables across diverse model structures.

2.3.5. Model Evaluation

The coefficient of determination (R²) and root mean square error (RMSE) were calculated using the sample independence test and used to evaluate the model. The formula for R² is given in (7) and for RMSE in (8).

R^{2} = 1 - \frac{{Σ_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}}{{Σ_{i = 1}^{n} (y_{i} - \bar{y_{i}})}^{2}}

(7)

R M S E = \sqrt{\frac{{Σ_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(8)

where n is the number of sample observations,

y_{i}

is the actual AGB value,

{\hat{y}}_{i}

is the estimate, and

\bar{y_{i}}

is the average AGB of the observed samples.

3. Analysis of Results

3.1. Forest AGB and Remote Sensing Factor Weights in Different Variable Selection Methods

The remotely sensed dataset after screening by the six variable selection methods is shown in Figure 4. Since each method evaluates the weighting of remotely sensed eigenfactors in different ways, there is variability in the number of variable selection results and the types of remotely sensed factors. VIF-Lasso is consistent with the number of Lasso (18), where S2_B4_ME_5 performs better in both variable selection methods, but the types of remote sensing factors screened are not identical. Eight remote sensing factors were screened by Lasso-GA, but the screened remote sensing types included three types of optical remote sensing, LiDAR, and MR. In addition, 11, 29, and 24 numbers of remote sensing eigenfactors were selected by Boruta, RFIS, and PC variable selection methods, respectively. The overall variable screening showed that the texture factor was more important among the optical remote sensing factors (B3, B4, B5, B11, B12). Sensitivity and rv_a2 are more important in LiDAR. The 3 × 3 and 5 × 5 window Correlation of HV has better contribution to the MR remote sensing factor.

3.2. Comparison of Model Fitting of 6 Variable Selection Methods in a Test Set of 8 Machine Learning Methods

In this study, a combination of Lasso, Lasso-GA, VIF-Lasso, Boruta, RFIS, and PC with RF, XGBoost, SVM, BRNN, EN, K-NN, ETR, and SGBoost were used for forest AGB estimation. Independent samples were tested using R² and RMSE and the test set (30%) results are shown in Figure 5 and Figure 6. The study showed that the best fit was obtained in RF (R² = 0.55, RMSE = 22.4 Mg/ha) and SVM (R² = 0.55, RMSE = 22.31 Mg/ha) by Boruta’s variable selection method. Meanwhile, RFIS (R² = 0.62, RMSE = 20.55 Mg/ha) and PC (R² = 0.60, RMSE = 21.50 Mg/ha) had the best fit in Bay BRNN. However, the best fit of Lasso variable selection was with SGBoost (R² = 0.69, RMSE = 18.63 Mg/ha) compared to RFIS and PC in BRNN. The Lasso variable selection method outperforms Boruta, PC, and RFIS in all eight machine learning fitting methods. The variable selection method of Lasso-GA fitted best with the ERT (R² = 0.73, RMSE 16.70 Mg/ha) model, which was superior to the SGBoost model of Lasso, with an increase of 0.04 in R² and a decrease of 1.93 Mg/ha in RMSE. In addition, SGBoost (R² = 0.72, RMSE = 18.35 Mg/ha) and KNN (R² = 0.70, RMSE = 18.78 Mg/ha) also showed good fit under the Lasso-GA variable selection method. Moreover, BRNN (R² 0.75, RMSE = 16.48 Mg/ha) with VIF-Lasso variable selection was also the best, with an increase of 0.06 in R² and a decrease of 2.15 Mg/ha in RMSE compared to the SGBoost model with Lasso variable selection. SGBoost also performed well under the VIF-Lasso variable selection with an R² of 0.74 and an RMSE of 16.94 Mg/ha. The EN and SVM models showed better performance in model fitting performance when the forest AGB exceeded 150 Mg/ha. In contrast, the BRNN, ERT, KNN, RF, and SGBoost models demonstrated better fitting performance when the forest AGB was below 150 Mg/ha. Notably, the forest AGB estimation of all other models show stronger robustness, except for the SVM model. In summary, the phantoms are highly adaptable and stable across different datasets and variable selection methods. The overall performance of the models is in the following ranking: BRNN > SGBoost > ERT > XGBoost > KNN > RF > EN > SVM. The Lasso-GA and VIF-Lasso variable selection methods have significant advantages in improving the accuracy of forest AGB estimates. By effectively retaining the features that are most relevant to the target variable, they remove noisy features while improving the model performance.

3.3. Forest AGB Inversion Estimation

Figure 7 illustrates the results of the maps using six variable selection methods and eight machine learning methods to back-estimate the forest AGB in the study area. The inversion results showed that the KNN models of Lasso and PC variable selection methods were poorly fitted and the deviation of RMSE estimates ranged from 100 to 130 Mg/ha (Figure 5). In contrast, a better fit for the KNN model was found for Lasso-GA and VIF-Lasso, with results of RMSE values ranging from 0 Mg/ha to more than 160 Mg/ha compared to the Lasso variable selection results. In addition, the fact that the same machine learning algorithm exhibits less variability across different variable selection methods maintains a consistent distribution of inversion results. However, the results of the learning estimation show that there are large differences in the selection methods of filtering variables for different remotely sensed data. The variable selection method is critical to the results of forest AGB estimation. Forest AGB can be estimated more accurately by incorporating different types of remote sensing factors in the variable selection process.

4. Discussion

4.1. Selection Variables of the AGB Model

It is important to select accurate remote sensing factors that are essential to significantly improve the accuracy of forest AGB estimates [4,63]. The six types of variable selection methods were used in this study to synthesize multi-source remote sensing data. In Lasso variable selection, there are fewer LiDAR remote sensing factors, which may be due to the exclusion of LiDAR when it is highly correlated with the other two remote sensing factors [64]. Quantity and type of differences between VIF-Lasso and Lasso variable selection results were relatively small, but we used VIF to remove remote sensing factors with multicollinearity and extract contributing redundant remote sensing factors, improving the explanatory power of the model [62]. The GA performs global search and optimal feature selection by simulating evolutionary processes such as genetic selection, crossover, and mutation, and it further searches for the best combination of remote sensing elements based on ensuring Lasso screening [65]. The PC variable selection resulted in the absence of the ALOS-2 PLASAR-2 remote sensing factor, while only remote sensing factors highly correlated with forest AGB were selected [66]. The results of the Boruta variable screening were biased in favor of MR and GEDI on account of the high correlation between the optical remote sensing factors, which affected Boruta’s assessment of its remote sensing factors [67]. The higher scoring remote sensing factors were ranked higher in the RFIS variable selection, but we were unable to choose the number of remote sensing factors that were appropriately ranked, so this variable selection method included a larger number of remote sensing factors [68].

4.2. Comparison of Variable Selection Methods

The accuracy of forest AGB estimation is affected by sensor type, variable selection method, and model fitting [69,70]. Therefore, to ensure the accuracy of precision, the factors were screened in this study using the Lasso-optimized variable selection method. It compared the remote sensing factor screening results of VIF-Lasso and Lasso-GA with Lasso, PC, RFIS, and Boruta methods across eight fitting models. The results showed that VIF-Lasso and Lasso-GA variable selection methods outperformed the other four methods across the eight model fittings. PC ignores nonlinear relationships and only identifies forest AGB that are linearly correlated with remote sensing factors [71]. The problem of high correlation between the feature factors of a large number of datasets can be solved by Lasso, but Lasso becomes unstable when the sample size is smaller than the feature set. On the other hand, the computational costs are high in Boruta [72]. In addition, RFIS may overemphasize one variable at the expense of others, complicating the determination of the most appropriate features when variables are highly correlated with each other [73]. However, the optimal R² achieved with Lasso variable selection in the SGBoost model was 0.69, outperforming PC (BRNN, R² = 0.54), RFIS (BRNN, R² = 0.62), and Boruta (SVM, R² = 0.55). The effectiveness of Lasso in a multi-data context was confirmed, but the R² of the fitted model was low due to limitations in variable selection. VIF-Lasso and Lasso-GA improve R² compared to Lasso by optimizing variable selection, and due to their enhanced ability to handle multiple data. In this study, 84 highly correlated eigenvariables were excluded and 509 eigenfactors were retained when the VIF value was greater than 10. The VIF method is effective in eliminating multicollinearity between variables but is unable to assess the correlation between the characteristics and the target variable [74,75]. Lasso has the ability to handle multicollinearity and to automatically select important features, but it suffers from strongly correlated features in high-dimensional data. For this purpose, the VIF and Lasso methods were combined and variables with VIF less than 10 were further screened using Lasso. The results showed that the BRNN model constructed with the features screened by VIF-Lasso performed the best (R² = 0.75), significantly outperforming the Lasso-only model. This shows that the VIF-Lasso method is effective in solving the multicollinearity problem and filtering out the features that are closely related to the target variables. In addition, better R² is difficult to obtain because of the limitations of Lasso, but GA shows advantages in this regard. Identifying the best subset in a multi-feature set can be achieved by GA, but it struggles with multiple datasets. It is computationally intensive and slow to converge but captures nonlinear features and does not rely on linearity [76]. Ji et al. [77] and Örkcü et al. [78], in an experimental study, show that optimization of the model using GA leads to better eigenvectors and improves the estimation accuracy. In this study, the results of the Lasso variable selection were optimized using GA to obtain new feature data for model fitting. The highest R² for the ETR model fit was 0.73 in Lasso-GA, which is 0.04 higher than the best R² obtained with the SGBoost model using the Lasso variable selection structure. However, despite the improved estimation of forest AGB by nonlinear Lasso-GA, R² is still lower than the best linear VIF-Lasso model. The variable selection results are inherently problematic in Lasso. The use of GA to re-extract variables in Lasso improved the fitting accuracy, but some important variables were still lost during the initial screening process [79]. Despite this, nonlinear Lasso-GA variable selection outperforms the optimal linear VIF-Lasso in models such as XGBoost, RF, ERT, and KNN. Linear variable selection methods are simpler, easier to interpret, and more stable, while nonlinear methods are more complex, harder to interpret, and require more computational time [80]. Selection of appropriate methods is crucial in the variable selection process while providing a better estimation of forest AGB [81].

The accuracy of forest AGB estimation was affected by machine learning algorithms [82,83]. In this study, the performance of SVM and RF is relatively poor, while the robustness and fitting performance of XGBoost, BRNN, ERT, EN, SGBoost, and KNN are relatively good. The estimation results of machine learning algorithms are affected by factors such as model complexity, parameter selection, optimization, feature variable handling, and noise handling [84,85]. In addition, VIF-Lasso performs best in BRNN because the Bayesian regularization of BRNN can effectively deal with noisy datasets and optimize parameter tuning [38]. BRNN utilizes bi-directional information flow and Bayesian regularization, which is capable of capturing complex nonlinear relationships between features and target variables, even in limited samples of time-series or dependent data, and it can extract valuable insights [86]. Despite the complexity of BRNN models, their Bayesian regularization helps to reduce the risk of overfitting and enhances the generalization ability of the model. By adjusting parameters such as period and batch size, BRNN still performs well with limited data. The ETR model fit had an R² of up to 0.73 at Lasso-GA. ETR can randomly select split points and reduce variance, which improves parameter adjustment and enhances estimation accuracy [87]. The ERT mitigates variance and minimizes the risk of overfitting with small samples by averaging results from multiple decision trees. Its automatic feature importance evaluation and input optimization confer robustness to outliers and noise, ensuring stable predictive performance [88]. Compared to deep learning models, ERT is less sensitive to hyperparameters and requires minimal tuning. RF manages small datasets efficiently, and its integration features ensure it is powerful enough to capture complex relationships even when data availability is limited. XGBoost incorporates regularization to prevent overfitting and can capture nonlinear relationships with minimal data, remaining effective in small sample scenarios. SVM operates efficiently in high-dimensional spaces, using regression hyperplanes to minimize prediction error and provide strong generalization capabilities through well-tuned parameters. EN combines L1 and L2 regularization with grid search for parameter optimization, improving prediction accuracy and preventing overfitting in small-sample settings. KNN shows high accuracy in predicting samples based on sample similarity in small-sample settings. By automatically searching for the optimal K-value, KNN demonstrates feasibility and stability in such scenarios. SGBoost utilizes randomness in training to improve generalization and reduce overfitting, enabling high prediction accuracy and robustness in small datasets. Overall, combining these eight machine learning algorithms with appropriate feature selection and parameter optimization strategies proves their effectiveness in forest AGB modeling and prediction despite the small sample size. With appropriate tuning, these models are highly feasible for small sample applications and provide reliable estimates and robust predictions across different model structures.

Compared with other studies, the Lasso-GA and VIF-Lasso variable selection methods proposed in this study have significant advantages in combining multi-source remote sensing to estimate forest AGB. It can help to solve the problem of using less remote sensing data in estimating forest AGB or choosing inappropriate variable selection and machine learning methods, which leads to lower estimation accuracy. For example, De Almeida et al. [89] used geo-referenced inventory data from 132 sample plots to obtain a reference field AGB and calculated 333 metrics as well as used a feature selection variable program selection methodology, which showed an estimated R² of 0.70 for forest AGB. Zhao et al. [90] used remote sensing to estimate 232 remote sensing factors contained in Robinia pseudoacacia AGB using the Boruta variable selection method, and their estimation resulted in an R² of 0.66. Zhang et al. [91] proposed the performance of the proposed stability-heterogeneity-correlation-based ensemble method for AGB estimation with remote sensing factors of 51 in XGBoost with forest AGB estimation accuracy R² of 0.66. Wang et al. [92] used 49 remote sensing factors from Sentinel 2 and DEM data in forest AGB estimation with an R² of 0.62. Therefore, this study is important in the field of estimating accuracy of forest AGB using multi-source remote sensing.

4.3. Limitation and Future Research

The VIF-Lasso variable selection method has some advantages at the technical level. Remote sensing factors with multiple covariates were identified by VIF, which mainly considered the linear relationship between forest AGB and remote sensing features. However, although the Lasso-GA method can capture the nonlinear relationship between forest AGB and remote sensing factors, the genetic algorithm was difficult to control. The settings of parameters, such as population size, number of iterations, crossover rate, and mutation rate of GA have an important impact on the performance of the algorithm. The traditional variable selection methods, by contrast, are computationally sophisticated and highly interpretable. Therefore, VIF-Lasso is more advantageous relative to the traditional variable selection methods, while Lasso-GA still needs to be refined in variable selection. Although the accuracy of forest AGB estimates has been improved by optimizing the Lasso variable selection method, forest AGB estimates are still affected by many factors [19]. Factors affecting forest AGB estimation include soil properties, topography, temperature, and other environmental variables. Therefore, the accuracy of forest AGB estimates can be further improved by adding environmental factors [93,94]. The focus of this study is on Lasso optimization and tuning, but there are many other ways to eliminate irrelevant and redundant features in multiple datasets. Zhang et al. [95] proposed a series of methods to select the optimal feature domains to improve land cover classification in complex urbanized coastal areas, including two decision tree models and five Random Forest-based variable importance metrics. Tian et al. [96] developed an ML-SFFS methodology for detecting infected leaves at various stages. Machine learning algorithms can handle complex nonlinear relationships and are more robust to noise and outliers than traditional fitting models, automatically extracting relevant features to improve the accuracy of model fitting [97]. Therefore, exploring and applying advanced variable selection methods and combining various machine learning models in future research. It will improve the accuracy and reliability of forest AGB estimation and provide more scientific decision support for forest resource management and ecosystem protection. Additionally, this study utilized multisource remote sensing data, which enhanced the model’s robustness in capturing complex ecosystem changes. Although we did not validate the model’s application in other fields, it may need to be adjusted according to the specific conditions of different ecosystems [98,99]. With appropriate validation and regional adaptability analysis, these models could still be applicable to other research areas.

5. Conclusions

This study aimed to explore the selectivity of the VIF-Lasso and Lasso-GA variable selection methods in order to improve the accuracy of forest AGB estimation. This study successfully inverted the forest AGB in Wuyi Village by integrating five remote sensing data sources and constructing eight machine learning algorithms for evaluating six variable selection methods. The results were listed below:

Forest AGB estimates optical remote sensing as the most important, followed by LiDAR and then MR. The overall variable selection results show that optical remote sensing factors account for 66%, LiDAR for 20%, and MR for 14%.
Variable selection based on Lasso optimization yielded a better R². VIF-Lasso achieved the best model with an R² of 0.75 and an RMSE of 16.48 Mg/ha, while Lasso-GA achieved the best model with an R² of 0.73 and an RMSE of 16.78 Mg/ha. The best variable selection methods for machine learning are PFIS-BRNN, Boruta-SVM, (VIF-Lasso)-BRNN, Lasso-SGBoost, (Lasso-GA)-ERT, and PPC-BRNN.
The ranking of machine learning models by fitting ability is as follows: BRNN > SGBoost > ERT > XGBoost > KNN > EN > RF > SVM, with optimal R² values of 0.75, 0.74, 0.73, 0.69, 0.70, 0.72, 0.69, and 0.65, and RMSE values of 16.48 Mg/ha, 16.94 Mg/ha, 16.78 Mg/ha, 18.59 Mg/ha, 18.18 Mg/ha, 17.63 Mg/ha, 18.51 Mg/ha, and 19.64 Mg/ha, respectively.
The accuracy of forest AGB inversions is greatly influenced by the choice of variables. The selected remote sensing elements have a greater impact on the inversion results than the machine learning model selection.

In summary, an important aspect of forest AGB estimation accuracy is the use of appropriate variable selection methods and the utilization of multi-source remote sensing data to improve it. In addition, the key to improving the accuracy of forest AGB estimates is the appropriate combination of variable selection methods and estimation models. The optimized Lasso variable selection method proposed in this study can effectively improve the estimation accuracy of forest AGB and provides a good reference for estimating forest AGB using multi-source remote sensing data.

Author Contributions

E.W.: Conceptualization, Methodology, Software, Validation, Investigation, Formal analysis, writing—original draft, Writing—review and editing. T.H.: Investigation, Software, Writing—review and editing. Z.L.: Methodology, Writing—review and editing. L.B.: Investigation, Data curation. B.G.: Investigation, Data curation. Z.Y.: Investigation, Data curation. Z.F.: Investigation, Data curation. H.L.: Supervision, Validation. G.O.: Resources, Supervision, Project administration, funding acquisition, Writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was jointly supported by the Science and Technology Program of Yunnan Provincial Science and Technology Department, China (No. 202303AC100009), and the Education Talent of Xingdian Talent Support Program of Yunnan Province, China (No. XDYC-JYRC-2023-0083).

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Bonan, G.B. Forests and climate change: Forcings, feedbacks, and the climate benefits of forests. Science 2008, 320, 1444–1449. [Google Scholar] [CrossRef] [PubMed]
Ramoelo, A.; Cho, M.A.; Mathieu, R.; Madonsela, S.; Van De Kerchove, R.; Kaszta, Z.; Wolff, E. Monitoring grass nutrients and biomass as indicators of rangeland quality and quantity using random forest modelling and WorldView-2 data. Int. J. Appl. Earth Obs. Geoinf. 2015, 43, 43–54. [Google Scholar] [CrossRef]
Návar, J. Measurement and assessment methods of forest aboveground biomass: A literature review and the challenges ahead. In Biomass; IntechOpen: London, UK, 2010; pp. 27–64. [Google Scholar]
Lu, D.; Chen, Q.; Wang, G.; Liu, L.; Li, G.; Moran, E. A survey of remote sensing-based aboveground biomass estimation methods in forest ecosystems. Int. J. Digit. Earth 2016, 9, 63–105. [Google Scholar] [CrossRef]
Koch, B. Status and future of laser scanning, synthetic aperture radar and hyperspectral remote sensing data for forest biomass assessment. ISPRS J. Photogramm. Remote Sens. 2010, 65, 581–590. [Google Scholar] [CrossRef]
Lu, D. Aboveground biomass estimation using Landsat TM data in the Brazilian Amazon. Int. J. Remote Sens. 2005, 26, 2509–2525. [Google Scholar] [CrossRef]
Sarker, M.L.R.; Nichol, J.; Ahmad, B.; Busu, I.; Rahman, A.A. Potential of texture measurements of two-date dual polarization PALSAR data for the improvement of forest biomass estimation. ISPRS J. Photogramm. Remote Sens. 2012, 69, 146–166. [Google Scholar] [CrossRef]
Laurin, G.V.; Liesenberg, V.; Chen, Q.; Guerriero, L.; Del Frate, F.; Bartolini, A.; Coomes, D.; Wilebore, B.; Lindsell, J.; Valentini, R. Optical and SAR sensor synergies for forest and land cover mapping in a tropical site in West Africa. Int. J. Appl. Earth Obs. Geoinf. 2013, 21, 7–16. [Google Scholar] [CrossRef]
Indirabai, I.; Nilsson, M. Estimation of above ground biomass in tropical heterogeneous forests in India using GEDI. Ecol. Inform. 2024, 82, 102712. [Google Scholar] [CrossRef]
Padalia, H.; Prakash, A.; Watham, T. Modelling aboveground biomass of a multistage managed forest through synergistic use of Landsat-OLI, ALOS-2 L-band SAR and GEDI metrics. Ecol. Inform. 2023, 77, 102234. [Google Scholar] [CrossRef]
Silva, C.A.; Duncanson, L.; Hancock, S.; Neuenschwander, A.; Thomas, N.; Hofton, M.; Fatoyinbo, L.; Simard, M.; Marshak, C.Z.; Armston, J. Fusing simulated GEDI, ICESat-2 and NISAR data for regional aboveground biomass mapping. Remote Sens. Environ. 2021, 253, 112234. [Google Scholar] [CrossRef]
Xu, L.; Shu, Q.; Fu, H.; Zhou, W.; Luo, S.; Gao, Y.; Yu, J.; Guo, C.; Yang, Z.; Xiao, J. Estimation of Quercus biomass in Shangri-La based on GEDI spaceborne LiDAR data. Forests 2023, 14, 876. [Google Scholar] [CrossRef]
Shendryk, Y. Fusing GEDI with earth observation data for large area aboveground biomass mapping. Int. J. Appl. Earth Obs. Geoinf. 2022, 115, 103108. [Google Scholar] [CrossRef]
Tian, L.; Wu, X.; Tao, Y.; Li, M.; Qian, C.; Liao, L.; Fu, W. Review of remote sensing-based methods for forest aboveground biomass estimation: Progress, challenges, and prospects. Forests 2023, 14, 1086. [Google Scholar] [CrossRef]
Li, H.; Kato, T.; Hayashi, M.; Wu, L. Estimation of forest aboveground biomass of two major conifers in Ibaraki Prefecture, Japan, from palsar-2 and sentinel-2 data. Remote Sens. 2022, 14, 468. [Google Scholar] [CrossRef]
Englhart, S.; Keuck, V.; Siegert, F. Aboveground biomass retrieval in tropical forests—The potential of combined X-and L-band SAR data use. Remote Sens. Environ. 2011, 115, 1260–1271. [Google Scholar] [CrossRef]
Li, X.; Zhang, M.; Long, J.; Lin, H. A novel method for estimating spatial distribution of forest above-ground biomass based on multispectral fusion data and ensemble learning algorithm. Remote Sens. 2021, 13, 3910. [Google Scholar] [CrossRef]
Tao, Z.; Yi, L.; Bao, A.; Xu, W.; Wang, Z.; Xiong, S.; Bing, H. UAV or satellites? How to find the balance between efficiency and accuracy in above ground biomass estimation of artificial young coniferous forest? Int. J. Appl. Earth Obs. Geoinf. 2024, 134, 104173. [Google Scholar] [CrossRef]
Vafaei, S.; Soosani, J.; Adeli, K.; Fadaei, H.; Naghavi, H.; Pham, T.D.; Tien Bui, D. Improving accuracy estimation of Forest Aboveground Biomass based on incorporation of ALOS-2 PALSAR-2 and Sentinel-2A imagery and machine learning: A case study of the Hyrcanian forest area (Iran). Remote Sens. 2018, 10, 172. [Google Scholar] [CrossRef]
David, R.M.; Rosser, N.J.; Donoghue, D.N. Improving above ground biomass estimates of Southern Africa dryland forests by combining Sentinel-1 SAR and Sentinel-2 multispectral imagery. Remote Sens. Environ. 2022, 282, 113232. [Google Scholar] [CrossRef]
Zhao, P.; Lu, D.; Wang, G.; Liu, L.; Li, D.; Zhu, J.; Yu, S. Forest aboveground biomass estimation in Zhejiang Province using the integration of Landsat TM and ALOS PALSAR data. Int. J. Appl. Earth Obs. Geoinf. 2016, 53, 1–15. [Google Scholar] [CrossRef]
Chen, L.; Ren, C.; Bao, G.; Zhang, B.; Wang, Z.; Liu, M.; Man, W.; Liu, J. Improved object-based estimation of forest aboveground biomass by integrating LiDAR data from GEDI and ICESat-2 with multi-sensor images in a heterogeneous mountainous region. Remote Sens. 2022, 14, 2743. [Google Scholar] [CrossRef]
Tamiminia, H.; Salehi, B.; Mahdianpari, M.; Goulden, T. State-wide forest canopy height and aboveground biomass map for New York with 10 m resolution, integrating GEDI, Sentinel-1, and Sentinel-2 data. Ecol. Inform. 2024, 79, 102404. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X.; Guo, Z. Estimation of tree height and aboveground biomass of coniferous forests in North China using stereo ZY-3, multispectral Sentinel-2, and DEM data. Ecol. Indic. 2021, 126, 107645. [Google Scholar] [CrossRef]
Cao, C.; Wang, T.; Gao, M.; Li, Y.; Li, D.; Zhang, H. Hyperspectral inversion of nitrogen content in maize leaves based on different dimensionality reduction algorithms. Comput. Electron. Agric. 2021, 190, 106461. [Google Scholar] [CrossRef]
Bhadra, A.; Datta, J.; Polson, N.G.; Willard, B. Lasso meets horseshoe: A Survey. Stat. Sci. 2019, 34, 405–427. [Google Scholar] [CrossRef]
Zandler, H.; Brenning, A.; Samimi, C. Quantifying dwarf shrub biomass in an arid environment: Comparing empirical methods in a high dimensional setting. Remote Sens. Environ. 2015, 158, 140–155. [Google Scholar] [CrossRef]
Lazaridis, D.C.; Verbesselt, J.; Robinson, A.P. Penalized regression techniques for prediction: A case study for predicting tree mortality using remotely sensed vegetation indices. Can. J. For. Res. 2011, 41, 24–34. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, J.; Liang, S.; Li, X.; Liu, J. A stacking ensemble algorithm for improving the biases of forest aboveground biomass estimations from multiple remotely sensed datasets. GIScience Remote Sens. 2022, 59, 234–249. [Google Scholar] [CrossRef]
Shafiee, S.; Lied, L.M.; Burud, I.; Dieseth, J.A.; Alsheikh, M.; Lillemo, M. Sequential forward selection and support vector regression in comparison to LASSO regression for spring wheat yield prediction based on UAV imagery. Comput. Electron. Agric. 2021, 183, 106036. [Google Scholar] [CrossRef]
Signorino, C.S.; Kirchner, A. Using LASSO to model interactions and nonlinearities in survey data. Surv. Pract. 2018, 11, 1–10. [Google Scholar] [CrossRef]
Jiang, F.; Kutia, M.; Ma, K.; Chen, S.; Long, J.; Sun, H. Estimating the aboveground biomass of coniferous forest in Northeast China using spectral variables, land surface temperature and soil moisture. Sci. Total Environ. 2021, 785, 147335. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Li, M.; Liu, Z.; Li, C. Combining kriging interpolation to improve the accuracy of forest aboveground biomass estimation using remote sensing data. IEEE Access 2020, 8, 128124–128139. [Google Scholar] [CrossRef]
Llobet, E.; Brezmes, J.; Gualdrón, O.; Vilanova, X.; Correig, X. Building parsimonious fuzzy ARTMAP models by variable selection with a cascaded genetic algorithm: Application to multisensor systems for gas analysis. Sens. Actuators B Chem. 2004, 99, 267–272. [Google Scholar] [CrossRef]
Liu, H.-H.; Ong, C.-S. Variable selection in clustering for marketing segmentation using genetic algorithms. Expert Syst. Appl. 2008, 34, 502–510. [Google Scholar] [CrossRef]
Jin, H.; Zhao, Y.; Pak, U.; Zhen, Z.; So, K. Assessing the effect of ensemble learning algorithms and validation approach on estimating forest aboveground biomass: A case study of natural secondary forest in Northeast China. Geo-Spat. Inf. Sci. 2024, 561, 1–20. [Google Scholar] [CrossRef]
Yan, X.; Li, J.; Smith, A.R.; Yang, D.; Ma, T.; Su, Y.; Shao, J. Evaluation of machine learning methods and multi-source remote sensing data combinations to construct forest above-ground biomass models. Int. J. Digit. Earth 2023, 16, 4471–4491. [Google Scholar] [CrossRef]
Huang, T.; Ou, G.; Wu, Y.; Zhang, X.; Liu, Z.; Xu, H.; Xu, X.; Wang, Z.; Xu, C. Estimating the Aboveground Biomass of Various Forest Types with High Heterogeneity at the Provincial Scale Based on Multi-Source Data. Remote Sens. 2023, 15, 3550. [Google Scholar] [CrossRef]
Zhang, Y.; Ma, J.; Liang, S.; Li, X.; Li, M. An evaluation of eight machine learning regression algorithms for forest aboveground biomass estimation from multiple satellite data products. Remote Sens. 2020, 12, 4015. [Google Scholar] [CrossRef]
Hans, C. Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc. 2011, 106, 1383–1393. [Google Scholar] [CrossRef]
Singh, C.; Karan, S.K.; Sardar, P.; Samadder, S.R. Remote sensing-based biomass estimation of dry deciduous tropical forest using machine learning and ensemble analysis. J. Environ. Manag. 2022, 308, 114639. [Google Scholar] [CrossRef]
Jia, Z.; Zhang, Z.; Cheng, Y.; Borjigin, S.; Quan, Z. Grassland biomass spatiotemporal patterns and response to climate change in eastern Inner Mongolia based on XGBoost model estimates. Ecol. Indic. 2024, 158, 111554. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Verrelst, J.; Rivera, J.P.; Veroustraete, F.; Muñoz-Marí, J.; Clevers, J.G.; Camps-Valls, G.; Moreno, J. Experimental Sentinel-2 LAI estimation using parametric, non-parametric and physical retrieval methods–A comparison. ISPRS J. Photogramm. Remote Sens. 2015, 108, 260–272. [Google Scholar] [CrossRef]
Tian, Y.; Huang, H.; Zhou, G.; Zhang, Q.; Tao, J.; Zhang, Y.; Lin, J. Aboveground mangrove biomass estimation in Beibu Gulf using machine learning and UAV remote sensing. Sci. Total Environ. 2021, 781, 146816. [Google Scholar] [CrossRef]
Yang, Y.; Sun, Y.; Yang, Y. Analysis of Spatial Accessibility for Rural School Redistricting in West China: A Case Study of the Primary Schools in Zhenyuan County, Yunnan Province. In Proceedings of the 2017 4th International Conference on Information Science and Control Engineering (ICISCE), Changsha, China, 21–23 July 2017; pp. 193–197. [Google Scholar]
You, G.; Zhang, Y.; Liu, Y.; Schaefer, D.; Gong, H.; Gao, J.; Lu, Z.; Song, Q.; Zhao, J.; Wu, C. Investigation of temperature and aridity at different elevations of Mt. Ailao, SW China. Int. J. Biometeorol. 2013, 57, 487–492. [Google Scholar] [CrossRef] [PubMed]
Huang, C.; Wu, C.; Gong, H.; You, G.; Sha, L.; Lu, H. Decomposition of roots of different diameters in response to different drought periods in a subtropical evergreen broad-leaf forest in Ailao Mountain. Glob. Ecol. Conserv. 2020, 24, e01236. [Google Scholar] [CrossRef]
Young, S.S.; Carpenter, C.; Zhi-Jun, W. A study of the structure and composition of an old growth and secondary broad-leaved forest in the Ailao Mountains of Yunnan, China. Mt. Res. Dev. 1992, 12, 269–284. [Google Scholar] [CrossRef]
Xu, H.; Zhang, Z.; Ou, G. (Eds.) A Study on Estimation and Distribution for Forest Biomass and Carbon Storage in Yunnan Province; Yunnan Science and Technology Press: Kunming, China, 2019. [Google Scholar]
Huang, T.; Ou, G.; Xu, H.; Zhang, X.; Wu, Y.; Liu, Z.; Zou, F.; Zhang, C.; Xu, C. Comparing Algorithms for Estimation of Aboveground Biomass in Pinus yunnanensis. Forests 2023, 14, 1742. [Google Scholar] [CrossRef]
Yan, Z. Estimation of Forest Biomass in Beijing Based on Landsat 8 OLI and ALOS-2 PALSAR-2 Data. Master’s Thesis, Beijing Forestry University, Beijing, China, 2020. [Google Scholar]
Shen, W.; Li, M.; Huang, C.; Tao, X.; Wei, A. Annual forest aboveground biomass changes mapped using ICESat/GLAS measurements, historical inventory data, and time-series optical and radar imagery for Guangdong province, China. China. Agric. For. Meteorol. 2018, 259, 23–38. [Google Scholar] [CrossRef]
Iranzad, R.; Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int. J. Data Sci. Anal. 2024, 18, 1–15. [Google Scholar] [CrossRef]
Corrales, D.C.; Schoving, C.; Raynal, H.; Debaeke, P.; Journet, E.-P.; Constantin, J. A surrogate model based on feature selection techniques and regression learners to improve soybean yield prediction in southern France. Comput. Electron. Agric. 2022, 192, 106578. [Google Scholar] [CrossRef]
Torre-Tojal, L.; Bastarrika, A.; Boyano, A.; Lopez-Guede, J.M.; Grana, M. Above-ground biomass estimation from LiDAR data using random forest algorithms. J. Comput. Sci. 2022, 58, 101517. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen, L.; Wang, Y.; Ren, C.; Zhang, B.; Wang, Z. Optimal combination of predictors and algorithms for forest above-ground biomass mapping from Sentinel and SRTM data. Remote Sens. 2019, 11, 414. [Google Scholar] [CrossRef]
Ploton, P.; Barbier, N.; Couteron, P.; Antin, C.; Ayyappan, N.; Balachandran, N.; Barathan, N.; Bastin, J.-F.; Chuyong, G.; Dauby, G. Toward a general tropical forest biomass prediction model from very high resolution optical satellite images. Remote Sens. Environ. 2017, 200, 140–153. [Google Scholar] [CrossRef]
Yuan, Y.; Wang, X. Performance comparison of RGB and multispectral vegetation indices based on machine learning for estimating Hopea hainanensis SPAD values under different shade conditions. Front. Plant Sci. 2022, 13, 928953. [Google Scholar] [CrossRef] [PubMed]
Du, C.; Sun, L.; Bai, H.; Liu, Y.; Yang, J.; Wang, X. Quantitative detection of azodicarbonamide in wheat flour by near-infrared spectroscopy based on two-step feature selection. Chemom. Intell. Lab. Syst. 2021, 219, 104445. [Google Scholar] [CrossRef]
Chiarito, E.; Cigna, F.; Cuozzo, G.; Fontanelli, G.; Mejia Aguilar, A.; Paloscia, S.; Rossi, M.; Santi, E.; Tapete, D.; Notarnicola, C. Biomass retrieval based on genetic algorithm feature selection and support vector regression in Alpine grassland using ground-based hyperspectral and Sentinel-1 SAR data. Eur. J. Remote Sens. 2021, 54, 209–225. [Google Scholar] [CrossRef]
Fernández-Manso, O.; Fernández-Manso, A.; Quintano, C. Estimation of aboveground biomass in Mediterranean forests by statistical modelling of ASTER fraction images. Int. J. Appl. Earth Obs. Geoinf. 2014, 31, 45–56. [Google Scholar] [CrossRef]
Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1. [Google Scholar] [CrossRef] [PubMed]
Huang, C.-L.; Wang, C.-J. A GA-based feature selection and parameters optimization for support vector machines. Expert Syst. Appl. 2006, 31, 231–240. [Google Scholar] [CrossRef]
Gong, H.; Li, Y.; Zhang, J.; Zhang, B.; Wang, X. A new filter feature selection algorithm for classification task by ensembling pearson correlation coefficient and mutual information. Eng. Appl. Artif. Intell. 2024, 131, 107865. [Google Scholar] [CrossRef]
Degenhardt, F.; Seifert, S.; Szymczak, S. Evaluation of variable selection methods for random forests and omics data sets. Brief. Bioinform. 2019, 20, 492–503. [Google Scholar] [CrossRef] [PubMed]
Speiser, J.L.; Miller, M.E.; Tooze, J.; Ip, E. A comparison of random forest variable selection methods for classification prediction modeling. Expert Syst. Appl. 2019, 134, 93–101. [Google Scholar] [CrossRef]
Fassnacht, F.; Hartig, F.; Latifi, H.; Berger, C.; Hernández, J.; Corvalán, P.; Koch, B. Importance of sample size, data type and prediction method for remote sensing-based estimations of aboveground forest biomass. Remote Sens. Environ. 2014, 154, 102–114. [Google Scholar] [CrossRef]
Gleason, C.J.; Im, J. Forest biomass estimation from airborne LiDAR data using machine learning approaches. Remote Sens. Environ. 2012, 125, 80–91. [Google Scholar] [CrossRef]
Halme, E.; Pellikka, P.; Mottus, M. Utility of hyperspectral compared to multispectral remote sensing data in estimating forest biomass and structure variables in Finnish boreal forest. Int. J. Appl. Earth Obs. Geoinf. 2019, 83, 101942. [Google Scholar] [CrossRef]
Mizumoto, A. Calculating the relative importance of multiple regression predictor variables using dominance analysis and random forests. Lang. Learn. 2023, 73, 161–196. [Google Scholar] [CrossRef]
Dube, T.; Mutanga, O. Investigating the robustness of the new Landsat-8 Operational Land Imager derived texture metrics in estimating plantation forest aboveground biomass in resource constrained areas. ISPRS J. Photogramm. Remote Sens. 2015, 108, 12–32. [Google Scholar] [CrossRef]
Iban, M.C.; Sekertekin, A. Machine learning based wildfire susceptibility mapping using remotely sensed fire data and GIS: A case study of Adana and Mersin provinces, Turkey. Ecol. Inform. 2022, 69, 101647. [Google Scholar] [CrossRef]
Van der Meer, F.D.; Jia, X. Collinearity and orthogonality of endmembers in linear spectral unmixing. Int. J. Appl. Earth Obs. Geoinf. 2012, 18, 491–503. [Google Scholar] [CrossRef]
Arjasakusuma, S.; Swahyu Kusuma, S.; Phinn, S. Evaluating variable selection and machine learning algorithms for estimating forest heights by combining lidar and hyperspectral data. ISPRS Int. J. Geo-Inf. 2020, 9, 507. [Google Scholar] [CrossRef]
Örkcü, H.H. Subset selection in multiple linear regression models: A hybrid of genetic and simulated annealing algorithms. Appl. Math. Comput. 2013, 219, 11018–11028. [Google Scholar] [CrossRef]
Ji, Y.; Xu, K.; Zeng, P.; Zhang, W. GA-SVR algorithm for improving forest above ground biomass estimation using SAR data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6585–6595. [Google Scholar] [CrossRef]
Dong, H.; Li, T.; Ding, R.; Sun, J. A novel hybrid genetic algorithm with granular information for feature selection and optimization. Appl. Soft Comput. 2018, 65, 33–46. [Google Scholar] [CrossRef]
Heiskanen, J. Estimating aboveground tree biomass and leaf area index in a mountain birch forest using ASTER satellite data. Int. J. Remote Sens. 2006, 27, 1135–1158. [Google Scholar] [CrossRef]
Silveira, E.M.; Radeloff, V.C.; Martinuzzi, S.; Pastur, G.J.M.; Bono, J.; Politi, N.; Lizarraga, L.; Rivera, L.O.; Ciuffoli, L.; Rosas, Y.M. Nationwide native forest structure maps for Argentina based on forest inventory data, SAR Sentinel-1 and vegetation metrics from Sentinel-2 imagery. Remote Sens. Environ. 2023, 285, 113391. [Google Scholar] [CrossRef]
Li, X.; Liu, Z.; Lin, H.; Wang, G.; Sun, H.; Long, J.; Zhang, M. Estimating the growing stem volume of Chinese pine and larch plantations based on fused optical data using an improved variable screening method and stacking algorithm. Remote Sens. 2020, 12, 871. [Google Scholar] [CrossRef]
Li, Y.; Li, M.; Li, C.; Liu, Z. Forest aboveground biomass estimation using Landsat 8 and Sentinel-1A data with machine learning algorithms. Sci. Rep. 2020, 10, 9952. [Google Scholar] [CrossRef]
Tunca, E.; Köksal, E.S.; Öztürk, E.; Akay, H.; Taner, S.Ç. Accurate leaf area index estimation in sorghum using high-resolution UAV data and machine learning models. Phys. Chem. Earth Parts A/B/C 2024, 133, 103537. [Google Scholar] [CrossRef]
Lujan-Moreno, G.A.; Howard, P.R.; Rojas, O.G.; Montgomery, D.C. Design of experiments and response surface methodology to tune machine learning hyperparameters, with a random forest case-study. Expert Syst. Appl. 2018, 109, 195–205. [Google Scholar] [CrossRef]
Shrestha, A.; Mahmood, A. Review of deep learning algorithms and architectures. IEEE Access 2019, 7, 53040–53065. [Google Scholar] [CrossRef]
Keshari, R.; Ghosh, S.; Chhabra, S.; Vatsa, M.; Singh, R. Unravelling small sample size problems in the deep learning world. In Proceedings of the 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), New Delhi, India, 24–26 September 2020. [Google Scholar] [CrossRef]
Liu, Q.; Wu, Z.; Cui, N.; Jin, X.; Zhu, S.; Jiang, S.; Zhao, L.; Gong, D. Estimation of Soil Moisture Using Multi-Source Remote Sensing and Machine Learning Algorithms in Farming Land of Northern China. Remote Sens. 2023, 15, 4214. [Google Scholar] [CrossRef]
de Almeida, C.T.; Galvao, L.S.; Ometto, J.P.H.B.; Jacon, A.D.; de Souza Pereira, F.R.; Sato, L.Y.; Lopes, A.P.; de Alencastro Graça, P.M.L.; de Jesus Silva, C.V.; Ferreira-Ferreira, J. Combining LiDAR and hyperspectral data for aboveground biomass modeling in the Brazilian Amazon using different regression algorithms. Remote Sens. Environ. 2019, 232, 111323. [Google Scholar] [CrossRef]
Zhao, Q.; Yu, S.; Zhao, F.; Tian, L.; Zhao, Z. Comparison of machine learning algorithms for forest parameter estimations and application for forest quality assessments. For. Ecol. Manag. 2019, 434, 224–234. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, J.; Li, W.; Liang, S. A proposed ensemble feature selection method for estimating forest aboveground biomass from multiple satellite data. Remote Sens. 2023, 15, 1096. [Google Scholar] [CrossRef]
Wang, P.; Tan, S.; Zhang, G.; Wang, S.; Wu, X. Remote Sensing Estimation of Forest Aboveground Biomass Based on Lasso-SVR. Forests 2022, 13, 1597. [Google Scholar] [CrossRef]
Peng, D.; Zhang, H.; Liu, L.; Huang, W.; Huete, A.R.; Zhang, X.; Wang, F.; Yu, L.; Xie, Q.; Wang, C. Estimating the aboveground biomass for planted forests based on stand age and environmental variables. Remote Sens. 2019, 11, 2270. [Google Scholar] [CrossRef]
Montesano, P.; Cook, B.; Sun, G.; Simard, M.; Nelson, R.; Ranson, K.; Zhang, Z.; Luthcke, S. Achieving accuracy requirements for forest biomass mapping: A spaceborne data fusion method for estimating forest biomass and LiDAR sampling error. Remote Sens. Environ. 2013, 130, 153–170. [Google Scholar] [CrossRef]
Zhang, F.; Yang, X. Improving land cover classification in an urbanized coastal area by random forests: The role of variable selection. Remote Sens. Environ. 2020, 251, 112105. [Google Scholar] [CrossRef]
Tian, L.; Xue, B.; Wang, Z.; Li, D.; Yao, X.; Cao, Q.; Zhu, Y.; Cao, W.; Cheng, T. Spectroscopic detection of rice leaf blast infection from asymptomatic to mild stages with integrated machine learning and feature selection. Remote Sens. Environ. 2021, 257, 112350. [Google Scholar] [CrossRef]
Anees, S.A.; Mehmood, K.; Khan, W.R.; Sajjad, M.; Alahmadi, T.A.; Alharbi, S.A.; Luo, M. Integration of machine learning and remote sensing for above ground biomass estimation through Landsat-9 and field data in temperate forests of the Himalayan region. Ecol. Inform. 2024, 82, 102732. [Google Scholar] [CrossRef]
Zhang, J. Multi-source remote sensing data fusion: Status and trends. Int. J. Image Data Fusion 2010, 1, 5–24. [Google Scholar] [CrossRef]
Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]

Figure 1. Technology roadmap for this study.

Figure 2. The study area and sample plot distribution: (a) The location of Zhenyuan in Yunnan Province; (b) Six Types of Remote Sensing Imagery; (c) Remote sensing image data of Wuyi Village.

Figure 3. Data distribution for the original dataset (60 samples), training set (42 samples), and test set (18 samples).

Figure 4. Results of variable selection: (a) Boruta’s variable selection results by comparing the shaded features with the original feature evaluation; (b) Lasso regularized compression of the eigenvectors obtained from the; (c) Lasso Variable Selection Results with GA variable selection Re-used in the Lasso variable selection case; (d) Results of variable selection with correlation coefficients greater than 0.5 between remote sensing factors and forest AGBs; (e) RFIS variable importance value selection results for each remote sensing factor; (f) Lasso variable selection results in the case of removing multicollinear remote sensing factors using VIF.

Figure 5. Scatterplots of forest AGB model test set fit using 8 algorithms for 6 variable choices.

Figure 6. The results of 6 variable selection results in 8 machine learning in the test set R² fitting results.

Figure 7. AGB inversion plot using 8 algorithms with 6 types of variable selection.

Table 1. The statistical parameters of Pinus kesiya var. langbianensis sample plot datasets.

Variables	Minimum	Mean	Maximum	STD
H (m)	3.20	10.90	16.90	2.70
Dg (cm)	5.20	9.80	45.40	1.15
AGB (Mg/ha)	63.78	122.10	207.40	30.89

Table 2. Information on remotely sensed data.

Image	Image ID	Cloud Volume (%)	Source	Access Time
Landsat 8 OLI	LC08_L1TP_130044_20230407_20230420_02_T1	0.88	https://scihub.copernicus.eu/	15 October 2023
Sentinel 2	S2A_MSIL1C_20230330T034531_N0509_R104_T47QPG_20230330T060328.SAFE	0.53	https://scihub.copernicus.eu/	12 October 2023
GEDL	GEDI02_B_2021002014205_O11653_03_T10693_02_003_01_V002 GEDI02_B_2021033131410_O12141_03_T09270_02_003_01_V002 GEDI02_B_2021158031308_O14072_02_T06143_02_003_01_V002 GEDI02_B_2022011125142_O17457_02_T07566_02_003_01_V002 GEDI02_B_2022044083839_O17966_03_T10693_02_003_01_V002	-	https://earthexplorer.usgs.gov/	8 October 2023
Sentinel 1	S1A_IW_GRDH_1SDV_20230203T112327_20230203T112352_05A589_185D.SAFE	-	https://search.asf.alaska.edu	15 September 2023
ALOS-2 PLASRA-2	0000519755_001001_ALOS2495483130-230727	-	https://www.earthdata.nasa.gov/	25 October 2023
DEM	ASTGTMV003_N24E101	0.61	http://www.gscloud.cn/	9 July 2023

Table 3. The image source and characteristic factors of remote sensing.

Image	Index	Abbreviation
Landsat 8 OLI	band1—coastal aerosol, band2—blue(BLU),band3—green (GRN),band4—red (RED),band5—near-infrared (NIR),band6—shortwave infrared 1 (SWIR1),and band7—shortwave infrared 2 (SWIR2).	B1, B2, B3, B4, B5, B6, B7
	normalized difference vegetation index	NDVI
	NDVI with band3 and band4	ND43
	NDVI with band6 and band7	ND67
	NDVI with band3 and band5 with band6	ND563
	difference vegetation index	DVI
	soil-adjusted vegetation index	SAVI
	ratio vegetation index	RVI
	brightness Vegetation Index	B
	greenness vegetation Index	G
	temperature vegetation index	W
	atmospherically resistant vegetation index	ARVI
	mid-infrared temperature vegetation index	MV17
	modified soil-adjusted vegetation index	MSAVI
	multiband Linear combination of band2 with band3 and band4	VIS234
	multiband Linear combination	ALBEDO
	Simple Ratio Index	SR
	improved vegetation index	SAV12
	optimized Simple Ratio Vegetation Index	MSR
	karst terrain factor 1	KT1
	principal component 1—factor A	PC1-A
	principal component 1—factor B	PC1-B
	principal component 1—factor P	PC1-P
Sentinel 2	B2-Blue, B3-Green, B4-Ged, B5-Gegetation red edge, B6-Vegetation, red edge, B7-Vegetation red edge, B8-NIR, B9-Water vapour, B10-SWIR-Cirrus, B11-SWIR,	B2, B3, B4, B5, B6, B7, B8, B9, B10
	ratio vegetation index	RVI
	difference vegetation index	DVI
	weighted difference vegetation index	WDVI
	infrared vegetation index	IPVI
	perpendicular vegetation index	PVI
	normalized difference vegetation index	NDVI
	NDVI with band4 and band5	NDVI45
	NDVI of green band	GNDVI
	inverted red edge chlorophyll index	IRECI
	soil adjusted vegetation index	SAVI
	transformed soil-adjusted, vegetation index	TSAVI
	modified soil-adjusted vegetation index	MSAVI
	sentinel-2 red edge position index	S2REP
	red edge infection point index	REIP
	atmospherically resistant, vegetation index	ARVI
	pigment-specific simple ratio, chlorophyll index	PSSRa
	Meris terrestrial chlorophyll index	MTCI
	modified chlorophyll absorption, ratio index	MCARI
GEDI	Total cover, defined as the percentage of the ground covered by the vertical projection of canopy material	cover
	Estimated Pgap(theta) for the selected, L2A algorithm	pgap_theta
	Total Plant Area Index	pai
	Leaf on day of year	leaf_on_doy
	Total Pgap(theta) error	pgap_theta_error
	Integral of the ground component in, the RX waveform	rg_aN
	Received waveform energy between, toploc and botloc with noise removed	rx_energy_aN
	Foliage height diversity index, calculated by vertical foliage profile, normalized by total plant area index.	fhd_normal
	quality flag	quality_flag
	Percentage non-vegetated from, MODIS data	modis_nonvegetated
	Percentage of tree cover from, MODIS data	modis_treecover
	DEM from GED	dem_
	Leaf off day of year	leaf_off_doy
	integral of the vegetation, component in the RX waveform	rv_aN
	Maximum canopy cover that can be penetrated considering the SNR of the waveform	sensitivity
	Height above ground of the received waveform signal start	rh100
	latitude and longitude	Lat_lowestmode, Lon_lowestmode
	Degrade flag	degrade_flag
ALOS-2 PLASRA-2	Horizontal Transmit, Horizontal Receive	HH
	Vertical Transmit, Vertical Receive	VV
	Horizontal Transmit, Vertical Receive	HV
	Vertical Transmit, Horizontal Receive	VH
	Dual polarization backscatter coefficient value	Factor
Sentinel 1	vertical transmit-vertical channel	VV
Sentinel 1	vertical transmit-horizontal channel	VH

Table 4. Descriptive Statistics of Training and Test Data.

Data (AGB)	Mean	Median	STD	Minimum	Maximum
Training Set (Mg/ha)	123.00	118.00	29.65	70.02	207.40
Testing Set (Mg/ha)	120.00	122.00	34.56	63.78	182.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, E.; Huang, T.; Liu, Z.; Bao, L.; Guo, B.; Yu, Z.; Feng, Z.; Luo, H.; Ou, G. Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method. Remote Sens. 2024, 16, 4497. https://doi.org/10.3390/rs16234497

AMA Style

Wang E, Huang T, Liu Z, Bao L, Guo B, Yu Z, Feng Z, Luo H, Ou G. Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method. Remote Sensing. 2024; 16(23):4497. https://doi.org/10.3390/rs16234497

Chicago/Turabian Style

Wang, Er, Tianbao Huang, Zhi Liu, Lei Bao, Binbing Guo, Zhibo Yu, Zihang Feng, Hongbin Luo, and Guanglong Ou. 2024. "Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method" Remote Sensing 16, no. 23: 4497. https://doi.org/10.3390/rs16234497

APA Style

Wang, E., Huang, T., Liu, Z., Bao, L., Guo, B., Yu, Z., Feng, Z., Luo, H., & Ou, G. (2024). Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method. Remote Sensing, 16(23), 4497. https://doi.org/10.3390/rs16234497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Forest Above-Ground Biomass Estimation Accuracy Using Multi-Source Remote Sensing and Optimized Least Absolute Shrinkage and Selection Operator Variable Selection Method

Abstract

1. Introduction

2. Study Area and Materials

2.1. Study Area

2.2. Data Collection from Sample Plots and Forest AGB Calculation

2.3. Extraction and Variable Screening of Remote Sensing Data

2.3.1. Remote Sensing Data-Acquiring

2.3.2. Pre-Processing

2.3.3. Variable Selection Methods

2.3.4. Model Construction

2.3.5. Model Evaluation

3. Analysis of Results

3.1. Forest AGB and Remote Sensing Factor Weights in Different Variable Selection Methods

3.2. Comparison of Model Fitting of 6 Variable Selection Methods in a Test Set of 8 Machine Learning Methods

3.3. Forest AGB Inversion Estimation

4. Discussion

4.1. Selection Variables of the AGB Model

4.2. Comparison of Variable Selection Methods

4.3. Limitation and Future Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI