Open AccessArticle

Regional Soil Moisture Estimation Leveraging Multi-Source Data Fusion and Automated Machine Learning

Shenglin Li

Pengyuan Zhu

Ni Song

Caixia Li

and

Jinglei Wang

Farmland Irrigation Research Institute, Chinese Academy of Agricultural Sciences, Xinxiang 453002, China

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(5), 837; https://doi.org/10.3390/rs17050837

Submission received: 14 December 2024 / Revised: 16 February 2025 / Accepted: 26 February 2025 / Published: 27 February 2025

Download

Browse Figures

Figure 1
Study area and distribution of sampling sites. Triangular markers indicate the locations of soil moisture (SM) sampling points. (a) Location of Henan Province, China; (b) Digital Elevation Model (DEM) of Henan Province and the location of the study area in Henan Province; (c) DEM of the study area; (d) Land cover classification map of the study area and the locations of sampling points. "> Figure 2
Flowchart showing overall methodology for soil moisture (SM) estimation. "> Figure 3
Statistical distribution of the full dataset, training set, and testing set. "> Figure 4
Statistical indicators of soil moisture estimation accuracy under six input scenarios, including R, RMSE, and RRMSE. "> Figure 5
Box plot illustrating the error distribution of the three AutoML algorithms under different scenarios. "> Figure 6
Scatter plot of the prediction results from three AutoML algorithms using SC6 (MS + TIR + auxiliary) as the input variables. "> Figure 7
Spatial and temporal distribution maps of soil moisture (SM). "> Figure 8
Distribution maps of soil moisture (SM) estimation using AutoGluon, TPOT, and H2O AutoML for 21 March 2015 and 3 April 2015. The first column represents 21 March 2015, and the second column represents 3 April 2015. ">

Versions Notes

Abstract

Soil moisture (SM) monitoring in farmland at a regional scale is crucial for precision irrigation management and ensuring food security. However, existing methods for SM estimation encounter significant challenges related to accuracy, generalizability, and automation. This study proposes an integrated data fusion method to systematically assess the potential of three automated machine learning (AutoML) frameworks—tree-based pipeline optimization tool (TPOT), AutoGluon, and H2O AutoML—in retrieving SM. To evaluate the impact of input variables on estimation accuracy, six input scenarios were designed: multispectral data (MS), thermal infrared data (TIR), MS combined with TIR, MS with auxiliary data, TIR with auxiliary data, and a comprehensive combination of MS, TIR, and auxiliary data. The research was conducted in a winter wheat cultivation area within the People’s Victory Canal Irrigation Area, focusing on the 0–40 cm soil layer. The results revealed that the scenario incorporating all data types (MS + TIR + auxiliary) achieved the highest retrieval accuracy. Under this scenario, all three AutoML frameworks demonstrated optimal performance. AutoGluon demonstrated superior performance in most scenarios, particularly excelling in the MS + TIR + auxiliary data scenario. It achieved the highest retrieval accuracy with a Pearson correlation coefficient (R) value of 0.822, root mean square error (RMSE) of 0.038 cm³/cm³, and relative root mean square error (RRMSE) of 16.46%. This study underscores the critical role of input data types and fusion strategies in enhancing SM estimation accuracy and highlights the significant advantages of AutoML frameworks for regional-scale SM retrieval. The findings offer a robust technical foundation and theoretical guidance for advancing precision irrigation management and efficient SM monitoring.

Keywords:

soil moisture; multi-source remote sensing; automated machine learning; data fusion

1. Introduction

Soil moisture (SM) refers to the proportion of water within unsaturated soil, typically measured either by volume or weight [1,2]. It plays a critical role in the water cycle, energy balance, and ecosystem functioning [3,4,5] and serves as a key parameter for drought monitoring, irrigation management, and crop yield prediction [6,7,8].

Currently, SM monitoring methods primarily include in situ ground observations, data assimilation, and remote sensing techniques. While in situ observations offer high accuracy, they are unable to capture the spatial heterogeneity of regional SM and are costly to implement. Data assimilation relies on numerous parameters that are often difficult to obtain and are subject to uncertainties. Although systems such as the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5-Land [9], the Global Land Data Assimilation System (GLDAS) [10], and the China Meteorological Administration Land Data Assimilation System (CLDAS) [11] have been developed, their coarse spatial resolution and uncertainties in accuracy limit their applicability at the farmland scale. In contrast, remote sensing provides an effective means to rapidly obtain SM information over large areas. Microwave remote sensing estimates SM by utilizing soil dielectric properties or radar backscattering characteristics and has produced several global SM products, such as the Soil Moisture Active Passive (SMAP), Advanced SCATterometer (ASCAT), Advanced Microwave Scanning Radiometer 2 (AMSR2), and Soil Moisture Ocean Salinity (SMOS) [12,13,14,15]. However, these products have low spatial resolutions (25–50 km), which are inadequate for farmland-scale monitoring [16]. Synthetic aperture radar (SAR)-based data, such as Sentinel-1, offer higher spatial resolution but are sensitive to vegetation coverage, resulting in poor estimation accuracy in densely vegetated areas [17]. Therefore, the development of a rapid and accurate method for SM retrieval remains an urgent challenge. Optical remote sensing, with its fine spatial resolution, has emerged as a promising alternative [18,19]. It indirectly reflects drought conditions and SM content by analyzing crop spectral characteristics, surface temperature, and canopy structure variations. For instance, Schnur et al. developed a method for estimating root zone SM using the normalized difference vegetation index (NDVI) [20,21]. Similarly, Han et al. successfully estimated farmland SM using moderate-resolution imaging spectroradiometer (MODIS) surface reflectance data and the enhanced vegetation index (EVI) [22,23]. However, relying solely on the NDVI or EVI may not effectively capture SM information, as factors such as land cover changes and pest infestations can cause anomalies in these indices, resembling those induced by drought. To more accurately reflect SM status, additional remote sensing indices have been proposed, including the shortwave infrared water stress index (SIWSI) [24], the vegetation supply water index (VSWI) [25], and the temperature vegetation dryness index (TVDI) [26].

The relationship between SM and remote sensing indices is complex and nonlinear [27]. Machine learning, with its exceptional capability to address nonlinear problems, has emerged as a promising tool for SM estimation [28,29]. Cheng et al. utilized Landsat 8 data and the random forest (RF) algorithm to establish relationships between indices such as the NDVI and TVDI and in situ SM measurements, enabling regional SM retrieval [30]. Similarly, Zhang et al. compared RF and extreme gradient boosting (XGBoost), employing Landsat 8, SMAP, the ERA5-Land reanalysis dataset, and auxiliary data, such as soil texture and topography, to generate 30 m resolution surface SM [31]. Given the limitations of individual models, ensemble learning methods have been developed to leverage the strengths of multiple models. Studies have reported that stacking algorithms, which integrate multiple models to reduce bias, often outperform any single machine learning model in terms of prediction accuracy and stability [32,33,34]. For example, Tao et al. applied tree-based algorithms such as categorical boosting (CatBoost), RF, and gradient boosting decision tree (GBDT) to develop a stacking ensemble model for SM retrieval in vineyards. Their results demonstrated that the stacking ensemble model exhibited higher accuracy and stability compared to any single algorithm [35]. Das et al. employed a stacking ensemble integrating RF, gradient boosting machines (GBMs), and cubist, effectively improving SM retrieval performance over individual models [36]. Most existing studies typically employ a single machine learning algorithm or a limited set of algorithms for modeling, failing to fully leverage the potential of the vast array of algorithms available in machine learning libraries. As a result, these studies often struggle to fully exploit the advantages of machine learning methods. Furthermore, existing research relies on manual algorithm selection and hyperparameter tuning, which not only makes it difficult to confirm the optimality of the chosen algorithms, but also results in time-consuming and complex hyperparameter optimization processes. These processes are often overlooked or hindered by researchers’ personal preferences, preventing the identification of optimal configurations and, consequently, limiting model performance improvements [37]. These limitations highlight significant bottlenecks in current SM inversion methods with regard to algorithm selection, model tuning, and accuracy enhancement. Therefore, there is an urgent need to develop an automated workflow capable of automatically selecting the best algorithms from a wide range of machine learning libraries and tuning hyperparameters to achieve efficient and accurate SM estimation. With the rapid development of automated machine learning (AutoML) technologies, addressing these issues has become more feasible. AutoML platforms typically integrate a wide range of commonly used machine learning algorithms, enabling automated algorithm selection, hyperparameter tuning, and model integration, thereby significantly improving model performance and efficiency. AutoML platforms, such as AutoGluon [38], tree-based pipeline optimization tool (TPOT) [39], auto-keras [40], and H2O AutoML [41], reduce manual intervention by automatically selecting algorithms and optimizing hyperparameters, enabling efficient model deployment. In recent years, AutoML has demonstrated significant potential across various fields. For instance, Xu et al. employed multiple AutoML frameworks to model and reveal the complex effects of microplastics on methane production [42]. Sun et al. utilized H2O AutoML to reconstruct GRACE data, bridging the one-year gap between GRACE missions and providing a vital foundation for large-scale water storage studies [37]. Zhang et al. developed an automated machine learning-assisted ensemble framework (AutoML-Ens) using H2O AutoML, improving global farmland evapotranspiration estimation [43]. Although some progress has been made, research on the application of AutoML in SM inversion remains limited, especially in terms of systematically comparing different AutoML algorithms and analyzing the impact of input variable combinations on SM estimation performance. Further investigations into the advantages and limitations of different algorithms and input configurations are necessary to advance more efficient and accurate SM estimation methods.

In this study, we selected the People’s Victory Canal Irrigation Area as the study region. This area is characterized by high agricultural water demand and relatively scarce water resources, making it a typical representative of agricultural environments in northern China. The irrigation methods in this region include canal irrigation and surface irrigation, covering a variety of soil types, which leads to significant variations in SM at different spatial and temporal scales [44]. Moreover, the spatial heterogeneity of SM is particularly pronounced in this region due to the complex interactions between irrigation practices, soil types, and crop species. These factors make SM variation in this region particularly challenging, thus providing an ideal setting for testing and improving SM estimation methods. These challenges further underscore the importance of developing accurate and reliable SM inversion techniques.

To address the aforementioned challenges, we utilized a total of ten features for SM retrieval, including multispectral remote sensing indices (NDVI, EVI, and SIWSI), thermal infrared remote sensing indices (land surface temperature (LST), TVDI, and VSWI), and auxiliary data including the digital elevation model (DEM) and soil properties (sand%, silt%, and clay%). Three AutoML models, namely TPOT, AutoGluon, and H2O AutoML, were employed to analyze their predictive performance in SM estimation. This study aims to address the following key questions: (1) How do TPOT, AutoGluon, and H2O AutoML compare in terms of SM retrieval accuracy? (2) What is the impact of different input feature combinations (MS, TIR, auxiliary data) on SM estimation performance? (3) Can AutoML effectively capture the spatiotemporal dynamics of SM in an agricultural setting?

2. Materials and Methods

2.1. Study Area

The People’s Victory Canal Irrigation Area is located in Henan Province, China, with geographic coordinates ranging from 113°30′E to 114°22′E and 34°58′N to 35°29′N (Figure 1). The region features flat terrain, primarily comprising an alluvial plain, and is characterized by a semi-arid to semi-humid climate. The annual average temperature ranges from 14 °C to 16 °C, and annual precipitation varies between 600 and 800 mm. The area experiences distinct seasons, with hot, rainfall-concentrated summers and cold, dry winters. The main crops cultivated are winter wheat (October to May) and summer maize (June to September). Irrigation water sources for the area include precipitation, surface water (diverted from the Yellow River), and groundwater [44]. This region is particularly suited for SM research due to the significant seasonal variation in precipitation and temperature, which strongly influence SM dynamics. The semi-arid to semi-humid climate, combined with intensive agricultural practices, creates an urgent need for accurate SM monitoring to optimize irrigation management and ensure crop health. Furthermore, the region benefits from diverse irrigation water sources, providing a unique opportunity to explore the relationship between SM and water resource management strategies.

2.2. Data Collection and Preprocessing

2.2.1. Remote Sensing Data

The remote sensing data utilized in this study comprise the MODIS reflectance product (MOD09GA) with a 500 m resolution and the Thermal and Reanalysis Integrating Moderate-resolution Spatial-seamless (TRIMS) LST, providing 1 km resolution all-weather LST data. MOD09GA is derived from the MODIS sensor on NASA’s Terra satellite, providing atmospherically corrected daily surface reflectance with a spatial resolution of 500 m, covering seven spectral bands from blue to shortwave [45]. Daily MOD09GA data were obtained from NASA’s Land, Atmosphere Archive and Distribution System (LAADS) Distributed Active Archive Center (DAAC) (https://ladsweb.modaps.eosdis.nasa.gov/search/, accessed on 11 October 2024). Before the MOD09GA data were used, pixels affected by clouds, cloud shadows, and other low-quality conditions were identified and removed based on the Quality Assessment (QA) band. This process ensures that only high-quality data are used for analysis. The TRIMS LST dataset offers daily all-weather LST data at a 1 km resolution for China and its surrounding regions. This dataset is developed using a novel LST temporal decomposition model, which combines MODIS LST, GLDAS LST, and auxiliary data such as vegetation indices and surface albedo to enhance accuracy and applicability [46,47]. With MODIS LST as a reference, the dataset shows a mean bias error (MBE) of 0.09 K and −0.03 K for daytime and nighttime, respectively, with standard deviations (STDs) of 1.45 K and 1.17 K. Validation against data from 19 ground stations reveals MBE values ranging from −2.26 K to 1.73 K, with root mean square error (RMSE) values ranging from 0.80 K to 3.68 K. No significant differences were found between clear-sky and non-clear-sky conditions. These results indicate that the dataset demonstrates high accuracy.

2.2.2. Auxiliary Data

Auxiliary data included the DEM and soil texture, both relevant to SM retrieval. The DEM was obtained from NASA’s global DEM product, featuring a spatial resolution of 30 m. As a successor to the Shuttle Radar Topography Mission (SRTM) DEM dataset, this product was developed by reprocessing SRTM data and integrating multiple DEM datasets to reduce data voids and enhance accuracy [48]. Soil texture data were sourced from the China National Soil Information Network, with a spatial resolution of 250 m. These data were generated using ensemble learning methods and encompass six soil depth layers (0–5 cm, 5–15 cm, 15–30 cm, 30–60 cm, 60–100 cm, and 100–200 cm) [49,50]. Given the absence of soil property data for the 0–40 cm depth range, this study employed soil texture data for the 0–30 cm depth as a substitute, calculated as a weighted average of the 0–5 cm, 5–15 cm, and 15–30 cm layers.

2.2.3. Ground Measurement Data

Soil sampling for the development of the SM estimation model was conducted between 22 March and 1 June 2015. Samples were taken from two depth intervals: 0–20 cm and 0–40 cm. Sampling was carried out approximately every 10 days, yielding eight data collection rounds on 21 March, 3 April, 11 April, 21 April, 1 May, 11 May, 21 May, and 1 June 2015. A total of 384 soil samples were collected from 24 sampling locations, as illustrated in Figure 1. Samples were obtained using an SM drill, sealed in labeled aluminum containers, weighed, and transported to the laboratory for processing. The samples were oven-dried at 105 °C to 110 °C for 12 h, cooled to room temperature, and reweighed. The gravimetric SM content (

θ_{m}

) was calculated using Equation (1). To measure soil bulk density

ρ_{b}

(g/cm³), samples were collected from different depths using a 100 cm³ ring sampler. The samples were dried and weighed in the laboratory, and

ρ_{b}

was determined using Equation (2). Volumetric SM content

θ_{v}

(cm³/cm³) was then calculated by combining

θ_{m}

and

ρ_{b}

, as described in Equation (3). During the period from March to May, the main water-absorbing layer of wheat is concentrated at a depth of 0–40 cm. Therefore, for model construction, SM at the 0–40 cm depth was used, obtained by averaging the measured values at the 0–20 cm and 20–40 cm depths.

θ_{m} = \frac{W_{ω}}{W_{d}}

(1)

ρ_{b} = \frac{W_{d}}{V}

(2)

θ_{v} = θ_{m} \times ρ_{b}

(3)

where

W_{ω}

represents the moisture mass in the soil sample (g), defined as the difference between the wet and dry soil masses,

W_{d}

is the dry soil mass (g), and

v

is the volume of the ring sampler, which is 100 cm³.

2.2.4. Data Preprocessing

Data preprocessing encompassed projection, resampling, cropping, and filtering. All datasets were projected to the WGS84 UTM Zone 49 coordinate system, with TRIMS LST, DEM, and soil texture data resampled to a spatial resolution of 500 m. To address the impact of clouds and atmospheric interference on MOD09GA data, the Savitzky–Golay (SG) filtering algorithm was employed to interpolate missing values, producing seamless daily MODIS reflectance datasets [51]. These processed datasets provided the foundational inputs for subsequent SM estimation.

2.3. Methods

This study aims to develop a high-precision SM estimation model by integrating multi-source data with AutoML techniques. Multispectral (MS), thermal infrared (TIR), and auxiliary data were leveraged to derive multiple SM-related feature indices, including vegetation and temperature indices. Six input scenarios were designed to systematically analyze the impact of different feature categories on SM estimation. Three mainstream AutoML frameworks (TPOT, AutoGluon, and H2O AutoML) were employed to automate hyperparameter optimization and model selection, establishing robust mapping relationships between input variables and the observed SM. The machine learning models were trained and evaluated using 70% and 30% of the dataset, respectively, ensuring comprehensive model assessment. The complete research workflow is illustrated in Figure 2.

The MOD09GA reflectance data and TRIMS LST data used in this study were both processed to address missing data issues. The MOD09GA reflectance data were filled using the SG filtering method to correct for quality issues caused by weather and other factors, while the TRIMS LST data were reprocessed to fill in the missing portions. As these datasets have undergone filling, potential errors may exist, which could affect the accuracy of the subsequent SM retrieval results. To minimize data errors, invalid data were removed based on the quality control bands, and the nearest neighbor interpolation method was applied to fill a small amount of missing data prior to use. The model construction also relied on auxiliary data, such as DEM and soil texture data, but in some areas, these datasets may be missing or lack sufficient accuracy, which could limit the model’s generalization ability and accuracy. Despite the potential advantages of this method, practical application still requires careful consideration of data quality, model adaptability, and computational resources to ensure the model’s effectiveness and operability. This approach provides strong support for precision agriculture management, particularly in SM monitoring and irrigation management, with significant practical implications.

2.3.1. Feature Extraction

This study identified ten indices related to SM (Table 1). Multispectral data were used to calculate the NDVI, EVI, and SIWSI. The NDVI is calculated using the red and near-infrared bands and reflects spectral changes in the canopy caused by water stress. The EVI is an improved version of the NDVI that incorporates the blue band to reduce soil background interference, making it suitable for scenarios with high vegetation cover or peak growth stages. In contrast, the NDVI is more appropriate for moderate vegetation cover or growth and serves as a complementary measure to the EVI [52]. The SIWSI leverages the sensitivity of shortwave infrared and red bands to water stress to reflect SM dynamics, providing valuable insights for drought stress analysis. Additionally, three indices reflecting temperature characteristics were utilized: LST, TVDI, and VSWI. These indices effectively capture crop canopy temperature dynamics and growth conditions. When SM is sufficient, the evaporative cooling effect is strong, resulting in lower surface temperatures. In contrast, when moisture is lacking, evaporation decreases, and surface temperatures rise. Therefore, LST can indirectly reflect changes in SM [5]. The TVDI combines LST and vegetation cover to provide a more comprehensive indication of SM. Higher TVDI values typically indicate drought or moisture deficit, while lower values suggest sufficient moisture [26]. Similarly, higher values of the VSWI indicate adequate moisture, while lower values suggest moisture deficiency or stress [25]. Meanwhile, based on the potential influence of auxiliary data (topography and soil texture) on SM, six input scenarios were designed by combining the variability of the three categories of multispectral, thermal infrared, and auxiliary data in order to comprehensively analyze the contribution of the different categories of features to the accuracy of SM estimation (Table 2). The selection of these indices is driven by their ability to effectively capture vegetation health, water stress, and temperature dynamics, which are critical for the accurate estimation of SM. Moreover, the inclusion of auxiliary data, such as DEM and soil texture, is necessary. Topography influences SM distribution by affecting precipitation patterns and surface water flow, while soil texture plays a direct role in determining the soil’s water retention capacity and permeability.

2.3.2. Machine Learning Models

To improve the efficiency and accuracy of SM estimation, this study employed three AutoML frameworks: TPOT, AutoGluon, and H2O AutoML. These frameworks streamline the model development process by automating hyperparameter optimization, algorithm selection, and model integration, which significantly enhance predictive performance [38,39,41]. Each framework integrates multiple algorithms and utilizes ensemble learning techniques to further refine model performance. The primary aim of this study is to assess the effectiveness of these frameworks in integrating multi-category input features and enhancing SM estimation. Specifically, the relationship between input variables and the observed SM was modeled using the following equation:

S M = f (I n p u t v a r i a b l e)

(4)

where f(.) denotes the machine learning model. To ensure fairness and comparability across the methods, a consistent training time of 5 min was set for all three AutoML frameworks.

TPOT model

TPOT optimizes machine learning pipelines using genetic programming [39]. TPOT includes a variety of base algorithms, such as decision tree, linear, ridge, lasso, elasticnet, RF, extremely randomized trees (XRTs), GBM, adaptive boosting (AdaBoost), voting, support vector regressor (SVR), and multi-layer perceptron (MLP). These algorithms span linear, tree-based, kernel-based, and neural network models, enabling TPOT to handle diverse data structures and relationships. TPOT leverages a genetic algorithm to explore and optimize pipelines by iteratively evaluating their performance using cross-validation. It incorporates ensemble methods such as bagging and boosting, which enhance the robustness and generalization ability of the models. For further details, please refer to https://epistasislab.github.io/tpot/ (accessed on 11 October 2023).

AutoGluon model

AutoGluon is designed for automated model selection and hyperparameter tuning with a focus on scalability and adaptability [38]. It integrates the following base algorithms: light gradient boosting machine (LightGBM), categorical boosting (CatBoost), extreme gradient boosting (XGBoost), RF, XRT, k-nearest neighbors (KNNs), linear, and neural networks. A key strength of AutoGluon is its multi-layered ensembling strategy, which builds ensembles by stacking models from diverse algorithm families. This approach ensures that the final prediction leverages the complementary strengths of various base models, resulting in improved performance across datasets with heterogeneous characteristics. For more information, visit https://auto.gluon.ai/ (accessed on 11 October 2023).

H2O AutoML model

H2O AutoML is an enterprise-grade framework known for its simplicity and computational efficiency [41]. It incorporates a set of core algorithms, including generalized linear model (GLM), GBM, distributed random forest (DRF), XRT, XGBoost and deep learning. H2O AutoML also emphasizes model ensembling, featuring stacked ensembles that combine the predictions of multiple base learners and optimize them using meta-learning techniques. A distinguishing feature of H2O AutoML is its scalability and ability to handle large-scale datasets efficiently, making it particularly suitable for high-resolution SM estimation tasks. Additional details are available at https://www.h2o.ai/products/h2o-automl/ (accessed on 11 October 2023).

TPOT, AutoGluon, and H2O AutoML each offer unique approaches to automating model selection and hyperparameter tuning. TPOT utilizes a genetic algorithm to optimize model pipelines, providing an exploratory method, although it incurs higher computational costs, making it more suitable for small- to medium-sized datasets. It employs a voting ensemble learning strategy to combine the predictions of multiple models. In contrast, AutoGluon leverages a multi-layer ensemble approach, aggregating the predictions of various base learners to improve accuracy, particularly in handling heterogeneous datasets, and employs Bayesian optimization for hyperparameter tuning. H2O AutoML emphasizes computational efficiency and scalability, which makes it well suited for large-scale problems. It applies random search for hyperparameter tuning and enhances prediction performance through meta-learning, which combines the outputs of several models.

2.3.3. Evaluation Strategy and Metrics

To comprehensively assess the performance of the proposed framework in SM inversion, multiple statistical indicators were used, including the Pearson correlation coefficient (R), root mean square error (RMSE), and relative root mean square error (RRMSE). These indicators have been widely applied in previous SM inversion studies [53,54,55]. Specifically, R quantifies the linear relationship between predicted and observed values, with values closer to 1 indicating stronger agreement between the model’s predictions and the actual observations. RMSE evaluates the average deviation between the predicted and observed values, with smaller values reflecting better model accuracy. RRMSE is the ratio of RMSE to the mean of the observed values, serving as a metric to assess the relative prediction error; smaller values signify lower relative error. The calculation methods for these statistical indicators are as follows:

R = \frac{\sum_{i = 1}^{N} [(θ_{i}^{E} - θ_{Mean}^{E}) - (θ_{i}^{O} - θ_{Mean}^{O})]}{\sqrt{\sum_{i = 1}^{N} {(θ_{i}^{E} - θ_{Mean}^{E})}^{2}} \sqrt{\sum_{i = 1}^{N} {(θ_{i}^{O} - θ_{Mean}^{O})}^{2}}}

(5)

RMSE = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(θ_{i}^{E} - θ_{i}^{O})}^{2}}

(6)

RRMSE = \frac{RMSE}{θ_{Mean}^{O}} \times 100

(7)

Here, N represents the number of measured or estimated SM sample points.

θ^{O}

denotes the measured SM, while

θ_{i}^{E}

and

θ_{i}^{O}

refer to the i-th estimated and measured SM, respectively.

θ_{Mean}^{E}

and

θ_{Mean}^{O}

represent the mean estimated and measured SM values, respectively.

3. Results

3.1. Descriptive Statistics

The dataset, consisting of 192 SM sample points collected between 21 March and 1 June 2015, was employed for model training and validation. It was divided into two subsets with a 7:3 ratio, allocating 70% (134 sample points) to the training set and 30% (58 sample points) to the testing set. Descriptive statistics for the full dataset, as well as the training and testing sets, are illustrated in Figure 3. In the full dataset, SM values range from 0.110 to 0.358 cm³/cm³, with a mean of 0.233 cm³/cm³ and a standard deviation of 0.063 cm³/cm³. Similarly, the training set exhibits SM values ranging from 0.110 to 0.358 cm³/cm³, with a mean of 0.244 cm³/cm³ and a standard deviation of 0.063 cm³/cm³. The testing set shows SM values ranging from 0.115 to 0.361 cm³/cm³, with a mean of 0.228 cm³/cm³ and a standard deviation of 0.063 cm³/cm³. The distribution of the training set closely mirrors that of the full dataset, ensuring that the model accurately captures the characteristics of the entire dataset, thereby enhancing the precision of regional SM predictions. Furthermore, the similarity in distribution between the training and testing sets reduces potential bias during model validation, providing a more accurate reflection of the model’s predictive performance.

3.2. SM Estimation Accuracy Under Different Input Scenarios

As shown in Table 3 and Figure 4, under the multi-category input scenario, SC6 (MS + TIR + auxiliary) achieved the highest accuracy, with an R value range of 0.737–0.822 (mean: 0.785), RMSE range of 0.038–0.043 cm³/cm³ (mean: 0.040 cm³/cm³), and RRMSE range of 16.46–18.25% (mean: 17.11%). This high performance can be attributed to the synergistic effects of MS, TIR, and auxiliary data, which together provide a more comprehensive representation of the factors influencing SM. In contrast, the accuracy of the single-category input scenarios, SC1 (MS) and SC2 (TIR), was relatively lower. Specifically, SC1 (MS) had an R range of 0.625–0.657 (mean: 0.641), RMSE range of 0.049–0.053 cm³/cm³ (mean: 0.051 cm³/cm³), and RRMSE range of 20.86–22.85% (mean: 21.83%). SC2 (TIR) showed a further reduction in performance, with an R range of 0.546–0.563 (mean: 0.557), RMSE range of 0.053–0.055 cm³/cm³ (mean: 0.054 cm³/cm³), and RRMSE range of 22.83–23.67% (mean: 23.13%), performing lower overall than SC1 (MS). Specifically, SC1 (MS) is prone to spectral saturation, which limits its ability to effectively capture the dynamic changes in deep SM. In contrast, SC2 (TIR) is more susceptible to interference from vegetation cover, which affects the accuracy of SM estimation. In the two-category variable input scenarios, SC3 (MS + TIR) yielded an R range of 0.675–0.752 (mean: 0.701), RMSE range of 0.045–0.049 cm³/cm³ (mean: 0.047 cm³/cm³), and RRMSE range of 19.20–20.85% (mean: 20.01%). The inclusion of auxiliary data in SC4 (MS + auxiliary) and SC5 (TIR + auxiliary) improved accuracy. SC4 (MS + auxiliary) showed an R range of 0.706–0.781 (mean: 0.738), RMSE range of 0.043–0.045 cm³/cm³ (mean: 0.044 cm³/cm³), and RRMSE range of 18.36–19.36% (mean: 18.85%). SC5 (TIR + auxiliary) displayed an R range of 0.703–0.761 (mean: 0.741), RMSE range of 0.043–0.045 cm³/cm³ (mean: 0.044 cm³/cm³), and RRMSE range of 18.34–19.25% (mean: 18.70%). SC6 (MS + TIR + auxiliary) outperforms SC3 (MS + TIR) in terms of accuracy, clearly demonstrating the significant enhancement in SM estimation achieved through the inclusion of auxiliary data. Similarly, SC4 (MS + auxiliary) and SC5 (TIR + auxiliary) substantially improve estimation accuracy compared to SC1 (MS) and SC2 (TIR), further reinforcing the critical role of integrating auxiliary data. Specifically, the DEM contributes by capturing the influence of topography on SM distribution, thus improving the model’s ability to represent spatial variability. Soil physical properties, particularly moisture retention and permeability, are crucial for accurate SM estimation, as they directly govern the soil’s ability to retain water. The comparison between SC4 (MS + auxiliary) and SC5 (TIR + auxiliary) reveals that the addition of auxiliary data results in similar contributions from MS and TIR in SM inversion, with minimal accuracy differences, indicating comparable performance between the two.

A comprehensive analysis indicates that the inclusion of the most categories of data in SC6 (MS + TIR + auxiliary) provided the best performance. This further confirms the importance of multi-source data fusion in enhancing estimation accuracy.

3.3. Comparison of TPOT, AutoGluon, and H2O AutoML

A comprehensive evaluation of three AutoML algorithms across different scenarios demonstrates that AutoGluon consistently outperforms the other two algorithms based on multiple evaluation metrics (Table 3). In terms of R, AutoGluon achieves superior performance in most scenarios. Specifically, in SC6 (MS + TIR + auxiliary), its R value reaches 0.822, which is significantly higher than that of TPOT (R = 0.737) and H2O AutoML (R = 0.795), indicating its stronger ability to capture the relationship between input variables and SM. For error metrics, AutoGluon achieves an RMSE of 0.038 cm³/cm³ and an RRMSE of 16.46% in SC6 (MS + TIR + auxiliary), both notably lower than those of TPOT (RMSE = 0.043 cm³/cm³, RRMSE = 18.25%) and H2O AutoML (RMSE = 0.039 cm³/cm³, RRMSE = 16.63%), highlighting its superior predictive accuracy. In other scenarios, AutoGluon similarly demonstrates its advantages. For example, in SC3 (MS + TIR), it achieves an R value of 0.752, RMSE of 0.045 cm³/cm³, and RRMSE of 19.20%, outperforming TPOT (R = 0.675, RMSE = 0.049 cm³/cm³, and RRMSE = 20.85%) and H2O AutoML (R = 0.676, RMSE = 0.047 cm³/cm³, and RRMSE = 19.97%). While AutoGluon’s performance in SC1 (MS) is marginally lower than that of TPOT and H2O AutoML, its overall superiority across other scenarios remains evident. Considering the average metrics across all scenarios (Figure 5), AutoGluon achieves a mean R of 0.717, RMSE of 0.046 cm³/cm³, and RRMSE of 19.71%, outperforming TPOT (mean R = 0.671, mean RMSE = 0.047 cm³/cm³, mean RRMSE = 20.14%) and H2O AutoML (mean R = 0.693, mean RMSE = 0.047 cm³/cm³, mean RRMSE = 19.96%). These results further validate AutoGluon’s robust performance in SM estimation.

Figure 6 presents scatter plots of the predictions for SC6 (MS + TIR + auxiliary) from all three algorithms. The predicted values for all algorithms are symmetrically distributed around the 1:1 reference line, effectively reflecting the observed variation trends in SM without significant systematic bias. Notably, AutoGluon’s scatter points align more closely with the 1:1 reference line, indicating its superior predictive accuracy and greater stability compared to TPOT and H2O AutoML.

3.4. Spatiotemporal Distribution of SM

This study demonstrates that the combination of SC6 (MS + TIR + auxiliary) with AutoGluon achieves the highest accuracy among all tested scenarios and algorithms. Based on this combination, the spatiotemporal distribution of SM in the People’s Victory Canal Irrigation Area was generated (Figure 7). From a spatial perspective, significant differences in SM are observed among various land cover types. Non-vegetated areas exhibit lower SM levels, whereas vegetated regions show higher SM, reflecting distinct spatial distribution characteristics. From a temporal perspective, SM levels on 3 April and 11 April 2015 show a marked increase, closely corresponding to rainfall events. On 11 May 2015, a moderate SM increase is also observed, driven by rainfall; however, the magnitude of this increase is limited due to lower rainfall amounts. By 1 June 2015, SM reaches its lowest levels, coinciding with the wheat harvesting stage. This period is characterized by high temperatures, which intensify moisture evaporation and lead to a significant decline in SM. To further compare the differences among the three AutoML algorithms, two dates—21 March 2015 (a day without rainfall) and 3 April 2015 (a day with rainfall)—were selected for analysis (Figure 8). On 21 March 2015, TPOT’s predicted values are relatively lower compared to AutoGluon, while H2O AutoML slightly overestimates SM. On 3 April 2015, both TPOT and H2O AutoML produce predictions slightly higher than AutoGluon. These differences primarily arise from variations in the base algorithms and ensemble strategies employed by the three AutoML frameworks. However, overall, the spatial distribution patterns generated by the three algorithms are consistent, showing no significant discrepancies.

4. Discussion

This study systematically evaluates SM estimation accuracy across six input scenarios: MS, TIR, MS + TIR, MS + auxiliary, TIR + auxiliary, and MS + TIR + auxiliary. Furthermore, the performance of three AutoML methods—TPOT, AutoGluon, and H2O AutoML—was comprehensively assessed. While prior research has extensively explored the integration of MS and TIR data [30,56], the contribution of auxiliary data, such as the DEM and soil texture, has been largely overlooked. The results in Table 3 show that in SC3 (MS + TIR), TPOT, AutoGluon, and H2O AutoML achieved R, RMSE, and NRMSE values of 0.675, 0.049 cm³/cm³, and 20.85%; 0.752, 0.045 cm³/cm³, and 19.20%; and 0.676, 0.047 cm³/cm³, and 19.97%, respectively. These values are lower than those for SC6 (MS + TIR + auxiliary), where TPOT achieved 0.737, 0.043 cm³/cm³, and 18.25%; AutoGluon achieved 0.822, 0.038 cm³/cm³, and 16.46%; and H2O AutoML achieved 0.795, 0.039 cm³/cm³, and 16.63%. A comparison of SC3 (MS + TIR), SC4 (MS + auxiliary), and SC5 (TIR + auxiliary) shows that SC4 (MS + auxiliary) and SC5 (TIR + auxiliary) outperformed SC3 (MS + TIR), with higher R values and lower RMSE and NRMSE values (Table 3). These results highlight the significant contribution of auxiliary data, such as the DEM and soil texture, to improving SM estimation accuracy. Incorporating the DEM and soil texture into multimodal data fusion frameworks is therefore essential. Among the six input scenarios, SC6 (MS + TIR + auxiliary) yielded the highest estimation accuracy. As shown in Figure 4, SC6 (MS + TIR + auxiliary) markedly outperformed the other scenarios, indicating that incorporating diverse data sources significantly enhances SM estimation precision [57,58,59]. Multispectral and thermal infrared variables indirectly reflect SM conditions through spectral signatures [35,36], while the DEM and soil texture provide essential information on spatial distribution patterns and dynamic variability. The DEM captures key topographic features, such as elevation, slope, and relief, which govern surface runoff pathways and water accumulation zones and indirectly influence evaporation and infiltration processes [60,61]. Soil texture determines key soil hydrological properties, including water retention capacity, permeability, and drainage characteristics, which are fundamental to SM retention and transport dynamics [62]. Through the integration of these complementary factors, the temporal and spatial heterogeneity of SM can be more accurately represented, resulting in significant improvements in estimation accuracy. These findings highlight the indispensable contribution of auxiliary data to SM estimation and underscore the importance of leveraging comprehensive multimodal data integration to advance model performance and reliability.

In the development of modeling algorithms, machine learning-based SM inversion methods have been widely utilized in previous studies [29,33,57]. However, traditional approaches often rely on heuristic selection of one or a limited number of algorithms for comparison, choosing the best-performing model for prediction. This approach restricts the diversity of models explored and tends to overlook the complexity of hyperparameter tuning, ultimately reducing the robustness and generalizability of the model [37,44]. This study introduces three AutoML frameworks—TPOT, AutoGluon, and H2O AutoML—that facilitate the rapid deployment of diverse models, including linear models, tree-based models, ensemble learning techniques, kernel-based methods, and neural networks. These frameworks leverage advanced integration strategies, extensive algorithm coverage, and automated hyperparameter optimization to ensure robustness and accuracy in SM estimation [38,39,41]. All three algorithms demonstrated high estimation accuracy. Based on the average evaluation metrics across six input scenarios, AutoGluon outperformed TPOT and H2O AutoML, achieving an overall performance of R = 0.717, RMSE = 0.046 cm³/cm³, and RRMSE = 19.71%, compared to TPOT (R = 0.671, RMSE = 0.047 cm³/cm³, and RRMSE = 20.14%) and H2O AutoML (R = 0.693, RMSE = 0.047 cm³/cm³, and RRMSE = 19.96%). Among all algorithm–input combinations, SC6 (MS + TIR + auxiliary) combined with AutoGluon achieved the highest SM estimation accuracy (R = 0.822, RMSE = 0.038 cm³/cm³, and RRMSE = 16.46%). In most scenarios, AutoGluon outperforms TPOT and H2O AutoML, largely due to its multi-layer stacking ensemble strategy, which enhances prediction robustness and generalization by progressively combining different algorithms [38]. In contrast, TPOT leverages genetic programming to optimize model pipelines and integrates multiple models through a voting mechanism, while H2O AutoML optimizes model ensembles by training meta-learners using stacking techniques. Overall, AutoGluon excels in handling complex tasks and heterogeneous data, offering superior model diversity and flexibility compared to both H2O AutoML and TPOT. However, TPOT and H2O AutoML maintain distinct advantages in specific contexts. TPOT is particularly effective at optimizing model combinations for high-dimensional data and large feature spaces [39], whereas H2O AutoML demonstrates strong performance on large datasets by utilizing distributed computing to improve scalability and efficiency [41]. While AutoGluon generally performs well across a variety of scenarios, it still faces challenges when adapting to small sample sizes and complex data distributions. All AutoML frameworks are susceptible to overfitting with limited sample sizes, highlighting the importance of regularization and data augmentation. Additionally, the presence of complex data distributions can compromise model stability, underscoring the critical role of data preprocessing and feature engineering to enhance model performance. In terms of spatiotemporal distribution, SC6 (MS + TIR + auxiliary) combined with AutoGluon provided comprehensive SM predictions, revealing significant differences between land types and marked spatial heterogeneity in overall distribution. Temporal analysis demonstrated the model’s ability to accurately estimate SM fluctuations driven by environmental changes. For example, SM distribution increased significantly during rainfall events, a trend accurately captured (Figure 7). Notably, the predicted spatial distributions of SM were highly consistent across algorithms, exhibiting minimal differences. This consistency underscores the robust reliability of TPOT, AutoGluon, and H2O AutoML in SM mapping (Figure 8). Despite differences in their underlying methodologies, all three algorithms effectively captured the spatiotemporal characteristics of SM, reflecting their comparable capabilities in handling spatial and temporal variations.

To further assess the accuracy of AutoML methods compared to traditional machine learning approaches, SM estimation was conducted under the SC6 (MS + TIR + auxiliary) variable combination using the widely used RF and XGBoost models. These models were optimized via grid search for hyperparameter tuning. The results showed that RF achieved R, RMSE, and RRMSE values of 0.731, 0.046 cm³/cm³, and 19.74%, respectively, while XGBoost achieved 0.729, 0.044 cm³/cm³, and 18.88%, respectively. Both sets of metrics were lower than those achieved by the three AutoML algorithms (Table 3). Furthermore, to further validate the superiority of the AutoML methods over traditional machine learning models (RF and XGBoost), we conducted the Wilcoxon signed-rank test. The results revealed that all p-values were less than 0.05, indicating that the differences between AutoML and the traditional machine learning models (RF and XGBoost) are statistically significant. This further supports the advantages of AutoML methods in SM estimation. The superior performance of AutoML methods can be attributed to their ensemble learning capabilities, as the optimal algorithms selected in this study were all ensemble-based models. This level of integration and optimization surpasses what can be achieved with single algorithms alone, highlighting the advantages of AutoML frameworks in leveraging complex model architectures to enhance SM estimation accuracy.

Despite the high efficiency and accuracy demonstrated by the AutoML-based fusion approach for SM estimation, two key limitations remain in this study. First, the model’s performance is still heavily dependent on ground sampling data. Eight manual sampling campaigns were conducted in this study, a process that is both time-consuming and labor-intensive. Moreover, the spatial distribution density of the samples may still be insufficient to fully capture the heterogeneity of the study area. Second, the current method may face challenges in cross-regional applications, as the transferability of AutoML models is highly contingent on the similarity between the target and training regions in terms of surface characteristics (e.g., soil type, vegetation index), climatic conditions, and data distribution. To address these limitations, future research could explore three potential improvement strategies. First, in terms of data augmentation, aside from integrating existing observational data from regions with similar ecological characteristics [44,58], constructing standardized cross-regional datasets would improve the model’s transferability. Additionally, sharing datasets across regions could help mitigate the issue of local sample scarcity and reduce the model’s reliance on localized sampling data. Second, with regard to data fusion, during the data generation process, multiple scenario simulations could be produced using the Hydrus model, while establishing a dynamic “observation–simulation” coupling mechanism. Through the application of data assimilation techniques to correct model parameters in real time [63], it would be possible to both fill gaps in time series data and enhance the plausibility of the generated data by incorporating physical constraints. This approach would effectively improve the consistency between simulated data and actual observations, making the generated data more meaningful and enhancing the overall predictive capability of the model. Third, for algorithm optimization, a systematic evaluation of the performance differences among various AutoML frameworks under sparse sample conditions is needed, especially regarding their adaptability to spatial heterogeneity. A detailed analysis of these frameworks’ robustness in handling diverse environmental conditions (e.g., areas with complex terrain or significant variations in vegetation cover) would further improve the generalizability of the algorithms. The implementation of these improvement strategies would significantly enhance the model’s transferability.

5. Conclusions

This study proposed an integrated data fusion framework that combines multimodal data and AutoML techniques to estimate SM at depths of 0–40 cm in the People’s Victory Canal Irrigation Area. Through assessments of SM estimation accuracy across six input scenarios and comparisons of the performance of TPOT, AutoGluon, and H2O AutoML, the following key findings were derived:

(1): The integration of multispectral, thermal infrared, and auxiliary data achieved the highest SM estimation accuracy among the six scenarios, underscoring the pivotal role of multi-source data fusion in enhancing predictive performance;
(2): Among the three AutoML methods, AutoGluon outperformed TPOT and H2O AutoML, exhibiting superior predictive accuracy and model stability;
(3): The optimal SM estimation was achieved using a combination of multispectral, thermal infrared, and auxiliary data with the AutoGluon method, yielding an R of 0.822, RMSE of 0.038 cm³/cm³, and RRMSE of 16.46%.

These findings highlight the potential of combining multimodal data fusion with advanced AutoML techniques for achieving high-accuracy SM estimation. This approach offers valuable insights for enhancing agricultural management and water resource regulation in arid and semi-arid regions. Despite the promising results of AutoML applications, several challenges remain, particularly high computational costs and limited model interpretability. Future research could address these issues by exploring more efficient techniques or incorporating tools that enhance model interpretability. Additionally, integrating deep learning models in ensemble frameworks would facilitate the accurate capture of nonlinear relationships in the data, thus improving estimation accuracy. The utilization of higher-resolution satellite data, the development of standardized cross-regional datasets, and the incorporation of physical constraints would further augment the model’s practical applicability. These avenues merit further investigation in future studies.

Author Contributions

Methodology, S.L., P.Z. and N.S.; formal analysis, S.L. and N.S.; investigation, S.L. and P.Z.; resources, N.S. and C.L.; data curation, S.L. and C.L.; validation, S.L., N.S. and C.L.; writing—original draft, S.L. and N.S.; writing—review and editing, S.L., N.S. and J.W.; visualization, S.L.; supervision, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2023 YFD1300801-04).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable and detailed comments, which were crucial in improving the quality of this paper. The authors would like to acknowledge the use of MOD09GA surface reflectance data provided by the NASA LP DAAC, located just outside of Sioux Falls, SD, USA, at the USGS Earth Resources Observation and Science (EROS) Center, TRIMS LST data supplied by the National Tibetan Plateau/Third Pole Environment Data Center, the NASA DEM obtained from the NASA Earthdata platform, and soil texture data provided by the Soil Sub Center, National Earth System Science Data Center, National Science and Technology Infrastructure. We sincerely thank these organizations for their data support, which was essential for the successful completion of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Seneviratne, S.I.; Corti, T.; Davin, E.L.; Hirschi, M.; Jaeger, E.B.; Lehner, I.; Orlowsky, B.; Teuling, A.J. Investigating soil moisture–climate interactions in a changing climate: A review. Earth Sci. Rev. 2010, 99, 3–161. [Google Scholar] [CrossRef]
Al-Yaari, A.; Wigneron, J.; Ducharne, A.; Kerr, Y.; de Rosnay, P.; de Jeu, R.; Govind, A.; Al-Bitar, A.; Albergel, C.; Muñoz-Sabater, J.; et al. Global-scale evaluation of two satellite-based passive microwave soil moisture datasets (SMOS and AMSR-E) with respect to land data assimilation system estimates. Remote Sens. Environ. 2014, 149, 181–195. [Google Scholar] [CrossRef]
Dorigo, W.; Wagner, W.; Albergel, C.; Albrecht, F.; Balsamo, G.; Brocca, L.; Chung, D.; Ertl, M.; Forkel, M.; Gruber, A.; et al. ESA CCI Soil Moisture for improved Earth system understanding: State-of-the-art and future directions. Remote Sens. Environ. 2017, 203, 185–215. [Google Scholar] [CrossRef]
Zhou, S.; Williams, A.P.; Lintner, B.R.; Berg, A.M.; Zhang, Y.; Keenan, T.F.; Cook, B.I.; Hagemann, S.; Seneviratne, S.I.; Gentine, P. Soil moisture–atmosphere feedbacks mitigate declining water availability in drylands. Nat. Clim. Chang. 2021, 11, 38–44. [Google Scholar] [CrossRef]
Li, Z.-L.; Leng, P.; Zhou, C.; Chen, K.-S.; Zhou, F.-C.; Shang, G.-F. Soil moisture retrieval from remote sensing measurements: Current knowledge and directions for the future. Earth-Sci. Rev. 2021, 218, 103673. [Google Scholar] [CrossRef]
Martínez-Fernández, J.; González-Zamora, A.; Sánchez, N.; Gumuzzio, A.; Herrero-Jiménez, C. Satellite soil moisture for agricultural drought monitoring: Assessment of the SMOS-derived Soil Water Deficit Index. Remote Sens. Environ. 2016, 177, 277–286. [Google Scholar] [CrossRef]
Gu, Z.; Zhu, T.; Jiao, X.; Xu, J.; Qi, Z. Neural network soil moisture model for irrigation scheduling. Comput. Electron. Agric. 2021, 180, 105801. [Google Scholar] [CrossRef]
Champagne, C.; White, J.; Berg, A.; Belair, S.; Carrera, M. Impact of soil moisture data characteristics on the sensitivity to crop yields under drought and excess moisture conditions. Remote Sens. 2019, 11, 372. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Robinson, D.A.; Campbell, C.S.; Hopmans, J.W.; Hornbuckle, B.K.; Jones, S.B.; Knight, R.; Ogden, F.; Selker, J.; Wendroth, O. Soil moisture measurement for ecological and hydrological watershed-scale observatories: A review. Vadose Zone J. 2008, 7, 358–389. [Google Scholar] [CrossRef]
Shi, C.; Xie, Z.; Qian, H.; Liang, M.; Yang, X. China land soil moisture EnKF data assimilation based on satellite remote sensing data. Sci. China 2011, 54, 1430–1440. [Google Scholar] [CrossRef]
Entekhabi, D.; Njoku, E.G.; O’Neill, P.E.; Kellogg, K.H.; Crow, W.T.; Edelstein, W.N.; Entin, J.K.; Goodman, S.D.; Jackson, T.J.; Johnson, J. The soil moisture active passive (SMAP) mission. Proc. IEEE 2010, 98, 704–716. [Google Scholar] [CrossRef]
Bartalis, Z.; Wagner, W.; Naeimi, V.; Hasenauer, S.; Scipal, K.; Bonekamp, H.; Figa, J.; Anderson, C. Initial soil moisture retrievals from the METOP-A advanced Scatterometer (ASCAT). Geophys. Res. Lett. 2007, 34, L20401. [Google Scholar] [CrossRef]
Kim, S.; Liu, Y.Y.; Johnson, F.M.; Parinussa, R.M.; Sharma, A. A global comparison of alternate AMSR2 soil moisture products: Why do they differ? Remote Sens. Environ. 2015, 161, 43–62. [Google Scholar] [CrossRef]
Kerr, Y.H.; Waldteufel, P.; Wigneron, J.; Martinuzzi, J.; Font, J.; Berger, M. Soil moisture retrieval from space: The soil moisture and ocean salinity (SMOS) mission. IEEE Trans. Geosci. Remote Sens. 2001, 39, 1729–1735. [Google Scholar] [CrossRef]
Peng, J.; Tanguy, M.; Robinson, E.; Pinnington, E.; Evans, J.G.; Cooper, E.; Hannaford, J.; Blyth, E.; Dadson, S. Estimation and evaluation of high-resolution soil moisture from merged model and Earth observation data in the Great Britain. Remote Sens. Environ. 2021, 264, 112610. [Google Scholar] [CrossRef]
Filgueiras, R.; Frappart, F.; Tavares, P.; Papa, F. Retrieval of high-resolution soil moisture through combination of Sentinel-1 and Sentinel-2 data. Remote Sens. 2020, 12, 2303. [Google Scholar] [CrossRef]
Babaeian, E.; Sidike, P.; Newcomb, M.S.; Maimaitijiang, M.; White, S.A.; Demieville, J.; Ward, R.W.; Sadeghi, M.; LeBauer, D.S.; Jones, S.B.; et al. A new optical remote sensing technique for high-resolution mapping of soil moisture. Front. Big Data 2019, 2, 37. [Google Scholar] [CrossRef] [PubMed]
Nie, Y.; Du, X.; Zhang, Z. Applications of remote sensing in precision agriculture: A review. Remote Sens. 2020, 12, 2450. [Google Scholar] [CrossRef]
Schnur, M.T.; Xie, H.; Wang, X. Estimating root zone soil moisture at distant sites using MODIS NDVI and EVI in a semi-arid region of southwestern USA. Ecol. Inform. 2010, 5, 400–409. [Google Scholar] [CrossRef]
Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Han, Y.; Bai, X.; Shao, W.; Wang, J. Retrieval of soil moisture by integrating Sentinel-1A and MODIS data over agricultural fields. Water 2020, 12, 1726. [Google Scholar] [CrossRef]
Huete, A.R.; Liu, H.; Batchily, K. A comparison of vegetation indices over a global set of TM images for EOS-MODIS. Remote Sens. Environ. 1997, 59, 440–451. [Google Scholar] [CrossRef]
Fensholt, R.; Sandholt, I. Derivation of a shortwave infrared water stress index from MODIS near- and shortwave infrared data in a semiarid environment. Remote Sens. Environ. 2003, 87, 111–121. [Google Scholar] [CrossRef]
Hunt, E.R.; Yilmaz, M.T. Evaluation of Vegetation Supply Water Index (VSWI) derived from MODIS satellite data. J. Hydrol. 2007, 345, 119–130. [Google Scholar]
Sandholt, I.; Rasmussen, K.; Andersen, J. A simple interpretation of the surface temperature/vegetation index space for assessment of surface moisture status. Remote Sens. Environ. 2002, 79, 213–224. [Google Scholar] [CrossRef]
Yang, Y.; Wu, Z.; Zhang, J.; Wang, Q. A study on the complex and nonlinear relationship between soil moisture and remote sensing indices. J. Geophys. Res. Atmos. 2018, 123, 5123–5136. [Google Scholar]
Jia, X.; Zhao, L.; Chen, Z.; Wang, R. Machine learning applications in soil moisture estimation: Tackling nonlinear relationships using advanced algorithms. Remote Sens. 2020, 12, 686. [Google Scholar]
Özerdem, M.S.; Yıldız, Ö.; Kaya, Ş.; Kaplan, O. Utilizing machine learning techniques for the estimation of soil moisture based on remote sensing data. Comput. Electron. Agric. 2017, 142, 90–100. [Google Scholar]
Cheng, M.; Li, B.; Jiao, X.; Huang, X.; Fan, H.; Lin, R.; Liu, K. Using multimodal remote sensing data to estimate regional-scale soil moisture content: A case study of Beijing, China. Agric. Water Manag. 2022, 260, 107298. [Google Scholar] [CrossRef]
Zhang, Y.; Liang, S.; Zhu, Z.; Ma, H.; He, T. Soil moisture content retrieval from Landsat 8 data using ensemble learning. ISPRS J. Photogramm. Remote Sens. 2022, 185, 32–47. [Google Scholar] [CrossRef]
Cao, J.; Liu, F.; Zhao, J.; Yu, X.; Cheng, M.; Wang, Q. Application of stacking ensemble learning for enhancing soil moisture prediction using remote sensing data. Remote Sens. 2021, 13, 412. [Google Scholar]
Cui, X.; Liu, X.; Zhang, Z.; Gao, J.; Zhao, W. A stacking ensemble framework for robust agricultural yield prediction: Integrating multiple machine learning models. Comput. Electron. Agric. 2021, 180, 105856. [Google Scholar]
He, Y.; Zhao, Y.; Zhang, F.; Ma, W.; Yang, X. Ensemble learning-based methods for improving soil property estimation: A focus on stacking algorithms. Geoderma 2021, 385, 114874. [Google Scholar]
Tao, S.; Zhang, X.; Feng, R.; Qi, W.; Wang, Y.; Shrestha, B. Retrieving soil moisture from grape growing areas using multi-feature and stacking-based ensemble learning modeling. Comput. Electron. Agric. 2023, 204, 107537. [Google Scholar] [CrossRef]
Das, B.; Rathore, P.; Roy, D.; Chakraborty, D.; Jatav, R.S.; Sethi, D.; Kumar, P. Comparison of bagging, boosting and stacking algorithms for surface soil moisture mapping using optical-thermal-microwave remote sensing synergies. Catena 2022, 217, 106485. [Google Scholar] [CrossRef]
Sun, A.Y.; Scanlon, B.R.; Save, H.; Rateb, A. Reconstruction of GRACE total water storage through automated machine learning. Water Resour. Res. 2021, 57, e2020WR028666. [Google Scholar] [CrossRef]
Erickson, N.; Mueller, J.; Ulbricht, M.; Loveland, D.; Muehlbauer, M.; Daruwala, R.; Karnin, Z. AutoGluon-Tabular: Robust and accurate AutoML for structured data. arXiv 2020, arXiv:2003.06505. [Google Scholar]
Olson, R.S.; Bartley, N.; Urbanowicz, R.J.; Moore, J.H. Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the GECCO’16: Proceedings of the Genetic and Evolutionary Computation Conference 2016, Denver, CO, USA, 20–24 July 2016; ACM: New York, NY, USA, 2016; pp. 485–492. [Google Scholar]
Jin, H.; Song, Q.; Hu, X. Auto-Keras: An efficient neural architecture search system. In Proceedings of the KDD ’19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; ACM: New York, NY, USA, 2019; pp. 1946–1956. [Google Scholar]
LeDell, E.; Poirier, S. H2O AutoML: Scalable automatic machine learning. arXiv 2020, arXiv:2006.11468. [Google Scholar]
Xu, R.Z.; Cao, J.S.; Ye, T.; Wang, S.N.; Luo, J.Y.; Ni, B.J.; Fang, F. Automated machine learning-based prediction of microplastics induced impacts on methane production in anaerobic digestion. Water Res. 2022, 223, 118975. [Google Scholar] [CrossRef]
Chen, H.; Wang, T.; Zhang, Y.; Bai, Y.; Chen, X. Dynamically weighted ensemble of geoscientific models via automated machine-learning-based classification. Geosci. Model Dev. 2023, 16, 5685–5701. [Google Scholar] [CrossRef]
Li, S.; Han, Y.; Li, C.; Wang, J. A novel framework for multi-layer soil moisture estimation with high spatio-temporal resolution based on data fusion and automated machine learning. Agric. Water Manag. 2024, 306, 109173. [Google Scholar] [CrossRef]
Vermote, E.F.; Kotchenova, S. Atmospheric correction for the monitoring of land surfaces. J. Geophys. Res. Atmos. 2008, 113, D23. [Google Scholar] [CrossRef]
Tang, W.; Zhou, J.; Ma, J.; Wang, Z.; Ding, L.; Zhang, X.; Zhang, X. TRIMS LST: A daily 1 km all-weather land surface temperature dataset for China’s landmass and surrounding areas (2000–2022). Earth Syst. Sci. Data 2024, 16, 387–419. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, J.; Liang, S.; Wang, D. A practical reanalysis data and thermal infrared remote sensing data merging (RTM) method for reconstruction of a 1-km all-weather land surface temperature. Remote Sens. Environ. 2021, 260, 112437. [Google Scholar] [CrossRef]
Uuemaa, E.; Ahi, S.; Montibeller, B.; Muru, M.; Kmoch, A. Vertical accuracy of freely available global digital elevation models (ASTER, AW3D30, MERIT, TanDEM-X, SRTM, and NASADEM). Remote Sens. 2020, 12, 3482. [Google Scholar] [CrossRef]
Liu, F.; Zhang, G.L.; Song, X.; Li, D.; Zhao, Y.; Yang, J.; Yang, F. High-resolution and three-dimensional mapping of soil texture of China. Geoderma 2020, 361, 114061. [Google Scholar] [CrossRef]
Liu, F.; Wu, H.; Zhao, Y.; Li, D.; Yang, J.L.; Song, X.; Zhang, G.L. Mapping high-resolution national soil information grids of China. Sci. Bull. 2022, 67, 328–340. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Jönsson, P.; Tamura, M.; Gu, Z.; Matsushita, B.; Eklundh, L. A simple method for reconstructing a high-quality NDVI time-series data set based on the Savitzky–Golay filter. Remote Sens. Environ. 2004, 91, 332–344. [Google Scholar] [CrossRef]
Zhou, Y.T.; Flynn, K.C.; Gowda, P.H.; Wagle, P.; Ma, S.F.; Kakani, V.G.; Steiner, J.L. The potential of active and passive remote sensing to detect frequent harvesting of alfalfa. Int. J. Appl. Earth Obs. Geoinf. 2021, 104, 13. [Google Scholar] [CrossRef]
Long, D.; Bai, L.; Yan, L.; Zhang, C.; Yang, W.; Lei, H.; Quan, J.; Meng, X.; Shi, C. Generation of spatially complete and daily continuous surface soil moisture of high spatial resolution. Remote Sens. Environ. 2019, 233, 111364. [Google Scholar] [CrossRef]
Wang, S.; Li, R.; Wu, Y.; Wang, W. Estimation of surface soil moisture by combining a structural equation model and an artificial neural network (SEM-ANN). Sci. Total Environ. 2023, 876, 162558. [Google Scholar] [CrossRef] [PubMed]
Zhu, S.; Cui, N.; Zhou, J.; Xue, J.; Wang, Z.; Wu, Z.; Wang, M.; Deng, Q. Digital Mapping of Root-Zone Soil Moisture Using UAV-Based Multispectral Data in a Kiwifruit Orchard of Northwest China. Remote Sens. 2023, 15, 646. [Google Scholar] [CrossRef]
Wu, Z.; Cui, N.; Zhang, W.; Yang, Y.; Gong, D.; Liu, Q.; Zhao, L. Estimation of soil moisture in drip-irrigated citrus orchards using multi-modal UAV remote sensing. Agric. Water Manag. 2024, 302, 108972. [Google Scholar] [CrossRef]
Karthikeyan, L.; Mishra, A.K. Multi-layer high-resolution soil moisture estimation using machine learning over the United States. Remote Sens. Environ. 2021, 266, 112706. [Google Scholar] [CrossRef]
Huang, S.; Zhang, X.; Wang, C.; Chen, N. Two-step fusion method for generating 1 km seamless multi-layer soil moisture with high accuracy in the Qinghai-Tibet plateau. ISPRS J. Photogramm. Remote Sens. 2023, 197, 346–363. [Google Scholar] [CrossRef]
Abowarda, A.S.; Bai, L.; Zhang, C.; Long, D.; Li, X.; Huang, Q.; Sun, Z. Generating surface soil moisture at 30 m spatial resolution using both data fusion and machine learning toward better water resources management at the field scale. Remote Sens. Environ. 2021, 255, 112301. [Google Scholar] [CrossRef]
Habtezion, N.; Tahmasebi Nasab, M.; Chu, X. How does DEM resolution affect microtopographic characteristics, hydrologic connectivity, and modelling of hydrologic processes? Water Resour. Res. 2016, 30, 4870–4892. [Google Scholar] [CrossRef]
Sela, S.; Svoray, T.; Assouline, S. Soil water content variability at the hillslope scale: Impact of surface sealing. Water Resour. Res. 2012, 48, W03522. [Google Scholar] [CrossRef]
Huang, X.; Shi, Z.H.; Zhu, H.D.; Zhang, H.Y.; Ai, L.; Yin, W. Soil moisture dynamics within soil profiles and associated environmental controls. Catena 2016, 136, 189–196. [Google Scholar] [CrossRef]
Yinglan, A.; Wang, G.; Hu, P.; Lai, X.; Xue, B.; Fang, Q. Root-zone soil moisture estimation based on remote sensing data and deep learning. Environ. Res. 2022, 212, 113278. [Google Scholar]

Figure 1. Study area and distribution of sampling sites. Triangular markers indicate the locations of soil moisture (SM) sampling points. (a) Location of Henan Province, China; (b) Digital Elevation Model (DEM) of Henan Province and the location of the study area in Henan Province; (c) DEM of the study area; (d) Land cover classification map of the study area and the locations of sampling points.

Figure 2. Flowchart showing overall methodology for soil moisture (SM) estimation.

Figure 3. Statistical distribution of the full dataset, training set, and testing set.

Figure 4. Statistical indicators of soil moisture estimation accuracy under six input scenarios, including R, RMSE, and RRMSE.

Figure 5. Box plot illustrating the error distribution of the three AutoML algorithms under different scenarios.

Figure 6. Scatter plot of the prediction results from three AutoML algorithms using SC6 (MS + TIR + auxiliary) as the input variables.

Figure 7. Spatial and temporal distribution maps of soil moisture (SM).

Figure 8. Distribution maps of soil moisture (SM) estimation using AutoGluon, TPOT, and H2O AutoML for 21 March 2015 and 3 April 2015. The first column represents 21 March 2015, and the second column represents 3 April 2015.

Table 1. Input variables for soil moisture (SM) estimation.

Catalog	Input Variable	Formula	Reference
Multispectral	Normalized difference vegetation index (NDVI)	$NDVI = \frac{NIR - R}{NIR + R}$	[21]
	Enhanced vegetation index (EVI)	$EVI = \frac{2.5 \times (NIR - R)}{NIR + 6 \times R - 7.5 \times B + 1}$	[23]
	Shortwave infrared water stress index (SIWSI)	$SIWSI = \frac{SWIR - NIR}{SWIR + NIR}$	[24]
Thermal	Land surface temperature (LST)		[46]
	Vegetation supply water index (VSWI)	$VSWI = \frac{N D V I}{LST}$	[25]
	Temperature vegetation dryness index (TVDI)	$TVDI = \frac{LST - {LST}_{\min}}{{LST}_{\max} - {LST}_{\min}}$	[26]
Auxiliary	Digital elevation model (DEM)		[48]
	Sand		[49]
	Silt		[49]
	Clay		[49]

Note: B represents the blue band, R the red band, NIR the near-infrared band, and SWIR the shortwave infrared band, corresponding to bands 3, 1, 2, and 6 of MOD09GA, respectively.

Table 2. Combinations of six input scenarios and three automated machine learning (AutoML) methods.

Scenario	Input Parameter	AutoML Framework
SC1	Multispectral	TPOT, AutoGluon, and H2O AutoML
SC2	Thermal infrared	TPOT, AutoGluon, and H2O AutoML
SC3	Multispectral + Thermal infrared	TPOT, AutoGluon, and H2O AutoML
SC4	Multispectral+ Auxiliary	TPOT, AutoGluon, and H2O AutoML
SC5	Thermal infrared + Auxiliary	TPOT, AutoGluon, and H2O AutoML
SC6	Multispectral + Thermal infrared + Auxiliary	TPOT, AutoGluon, and H2O AutoML

Table 3. Summary of statistical indicators for the three algorithms under six input scenarios.

Scenario	TPOT			AutoGluon			H2O AutoML
Scenario	R	RMSE	RRMSE	R	RMSE	RRMSE	R	RMSE	RRMSE
SC1	0.640	0.049	20.86	0.625	0.053	22.85	0.657	0.051	21.77
SC2	0.546	0.053	22.83	0.563	0.053	22.89	0.563	0.055	23.67
SC3	0.675	0.049	20.85	0.752	0.045	19.20	0.676	0.047	19.97
SC4	0.726	0.044	18.82	0.781	0.043	18.36	0.706	0.045	19.36
SC5	0.703	0.045	19.25	0.760	0.043	18.52	0.761	0.043	18.34
SC6	0.737	0.043	18.25	0.822	0.038	16.46	0.795	0.039	16.63

Note: the unit of RMSE is cm³/cm³.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, S.; Zhu, P.; Song, N.; Li, C.; Wang, J. Regional Soil Moisture Estimation Leveraging Multi-Source Data Fusion and Automated Machine Learning. Remote Sens. 2025, 17, 837. https://doi.org/10.3390/rs17050837

AMA Style

Li S, Zhu P, Song N, Li C, Wang J. Regional Soil Moisture Estimation Leveraging Multi-Source Data Fusion and Automated Machine Learning. Remote Sensing. 2025; 17(5):837. https://doi.org/10.3390/rs17050837

Chicago/Turabian Style

Li, Shenglin, Pengyuan Zhu, Ni Song, Caixia Li, and Jinglei Wang. 2025. "Regional Soil Moisture Estimation Leveraging Multi-Source Data Fusion and Automated Machine Learning" Remote Sensing 17, no. 5: 837. https://doi.org/10.3390/rs17050837

APA Style

Li, S., Zhu, P., Song, N., Li, C., & Wang, J. (2025). Regional Soil Moisture Estimation Leveraging Multi-Source Data Fusion and Automated Machine Learning. Remote Sensing, 17(5), 837. https://doi.org/10.3390/rs17050837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu