Open AccessArticle

Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models

Hyesung Park

and

Sungwook Chung

Department of Computer Engineering, Changwon National University, Changwon 51140, Republic of Korea

Author to whom correspondence should be addressed.

Atmosphere 2025, 16(1), 60; https://doi.org/10.3390/atmos16010060

Submission received: 5 November 2024 / Revised: 2 January 2025 / Accepted: 6 January 2025 / Published: 8 January 2025

(This article belongs to the Special Issue Deep Learning Algorithms for Weather Forecasting and Climate Prediction)

Download

Browse Figures

Figure 1
The overall execution process of GloSea6. "> Figure 2
Operational structure of “um-atmos.exe”. "> Figure 3
Correlation heatmap of variables used in BiCGStab. "> Figure 4
Resolution size of the 3D grid data. (a) matches the latitude and longitude grid size of the Low GloSea6 UM model, and (b) is adjusted to be a multiple of 2 to facilitate the upsampling process in the U-Net architecture. "> Figure 5
U-Net architecture [<a href="#B30-atmosphere-16-00060" class="html-bibr">30</a>]. "> Figure 6
Half-UNet architecture [<a href="#B35-atmosphere-16-00060" class="html-bibr">35</a>]. "> Figure 7
CBAM-based Half-UNet (CH-UNet) architecture. "> Figure 8
Overall structure of CBAM and Sub-Attention Modules [<a href="#B36-atmosphere-16-00060" class="html-bibr">36</a>]. "> Figure 9
Hybrid-DL NWP model structure integrating CH-UNet in the UM model of Low GloSea6. "> Figure 10
Comparison of ’um-atmos.exe’ file execution time for each timestep. "> Figure 11
Comparison of RMSE for each deep network model’s prediction results during Low GloSea6 execution by Timestep. ">

Review Reports Versions Notes

Abstract

Conventional weather forecasting relies on numerical weather prediction (NWP), which solves atmospheric equations using numerical methods. The Korea Meteorological Administration (KMA) adopted the Met Office Global Seasonal Forecasting System version 6 (GloSea6) NWP model from the UK and runs it on a supercomputer. However, due to high task demands, the limited resources of the supercomputer have caused job queue delays. To address this, the KMA developed a low-resolution version, Low GloSea6, for smaller-scale servers at universities and research institutions. Despite its ability to run on less powerful servers, Low GloSea6 still requires significant computational resources like those of high-performance computing (HPC) clusters. We integrated deep learning with Low GloSea6 to reduce execution time and improve meteorological research efficiency. Through profiling, we confirmed that deep learning models can be integrated without altering the original configuration of Low GloSea6 or complicating physical interpretation. The profiling identified “tri_sor.F90” as the main CPU time hotspot. By combining the biconjugate gradient stabilized (BiCGStab) method, used for solving the Helmholtz problem, with a deep learning model, we reduced unnecessary hotspot calls, shortening execution time. We also propose a convolutional block attention module-based Half-UNet (CH-UNet), a lightweight 3D-based U-Net architecture, for faster deep-learning computations. In experiments, CH-UNet showed 10.24% lower RMSE than Half-UNet, which has fewer FLOPs. Integrating CH-UNet into Low GloSea6 reduced execution time by up to 71 s per timestep, averaging a 2.6% reduction compared to the original Low GloSea6, and 6.8% compared to using Half-UNet. This demonstrates that CH-UNet, with balanced FLOPs and high predictive accuracy, offers more significant execution time reductions than models with fewer FLOPs.

Keywords:

weather forecast; GloSea6; deep learning; lightweight network; execution time reduction

1. Introduction

Weather forecasting is crucial to modern society, including its impact on industry, agriculture, and public safety in general. However, current numerical weather prediction (NWP) models require a lot of computing resources and have a complex structure, despite the remarkable advancements in hardware technology. In 2021, the Korea Meteorological Administration (KMA) adopted Global Seasonal Forecasting System Version 6 (GloSea6), a global NWP model from the UK Met Office, for its climate prediction system. This model is currently operational on the fifth-generation supercomputer (Lenovo Group Ltd.) at the KMA.

The GloSea6 comprises four independent models operating synergistically through a coupler. The individual models include the atmospheric model Unified Model (UM version 8.6), the land surface model Joint UK Land Environment Simulator (JULES version 4.7), the ocean model Nucleus for European Modeling of the Ocean (NEMO version 3.4), and the sea ice model Los Alamos Sea-Ice Model (CICE version 4.1). The coupler facilitating the integration of these models is the Ocean Atmosphere Sea Ice Soil 3 Model Coupling Toolkit (OASIS3-MCT). Table 1 presents the detailed specifications and resolutions of the component models in GloSea6.

Therefore, GloSea6 is executed on supercomputers to accommodate its complex model structure and high-resolution computations. Due to the significant computational resources required, universities and research institutions that do not have access to supercomputers use the lower-resolution version of GloSea6, known as Low GloSea6. However, even when Low GloSea6 is run on high-performance computing servers or other specialized mid-sized servers, it still requires a considerable amount of time.

Meanwhile, with the recent advancements in artificial intelligence technology, research integrating traditional Numerical Weather Prediction (NWP) models with machine learning has been actively pursued in the field of meteorological science. Schultz et al. [1] established an effective framework for integrating NWP models with machine learning-based approaches. Furthermore, Kwok et al. [2], Chen et al. [3], Frnda [4], and Cho [5] proposed approaches that combine or entirely replace NWP models with deep learning-based models to accurately predict meteorological variables such as precipitation rates, cloud masks, and temperature.

Additionally, research has been conducted to replace the parameterization schemes of traditional NWP models with deep learning-based models, aiming to enhance accuracy or achieve faster execution times through the superior computational efficiency of deep learning models. Studies by Yao et al. [6], Mu et al. [7], and Zhong et al. [8] identified that solar radiation parameterization schemes in NWP models incur significantly higher computational costs compared to other physical processes. To address this, they replaced the solar radiation parameterization schemes with various deep learning models, achieving a minimum threefold improvement in computational efficiency. Furthermore, Chen et al. [9] and Zhong et al. [10] proposed approaches to replace cloud fraction and convection parameterization schemes with deep learning models, demonstrating improved accuracy while maintaining the original execution times. Moreover, Mu et al. [7], Zhong et al. [8], Mu et al. [11], and Wang et al. [12] proposed comprehensive frameworks for integrating Fortran-based NWP models with Python-based deep learning models, demonstrating that replacing specific submodules of NWP models with deep learning models is more advantageous in terms of computational efficiency.

In this study, we propose a framework to reduce the computation time of the Low GloSea6 NWP model by identifying time-consuming submodules and replacing them with deep learning models. To address the challenge of physical interpretability posed by the black-box nature of deep learning models, we integrated these models to accurately replicate the original outputs of the NWP model. Unlike previous studies that focus solely on reducing execution time and rely on conventional deep learning models, we propose a lightweight U-Net architecture with an encoder-decoder structure as an alternative. Finally, we present a comprehensive framework that ensures the trained deep learning model operates seamlessly within the Fortran 90 environment, enabling its integration into the Low GloSea6 system. The contributions of this research are as follows:

We analyzed the overhead and execution time-based hotspots of the Low GloSea6 weather model.
The structure of the identified hotspots was examined, and the parameters used were estimated to collect the optimal data for deep network training.
This study demonstrates the feasibility of applying deep network by selectively modifying part of the existing numerical computation solutions of Low GloSea6 with deep network, without increasing the physical interpretative challenges.
We further optimized the execution time by making simple modifications to the lightweight 3D convolution-based architecture when integrating it with the NWP, making it even more lightweight. Additionally, we improved the accuracy of predictions by adding an attention block to the deep network. We demonstrate and validate the resulting accuracy improvements.
Through the proposed method, we successfully integrated a deep network model into the numerical computation solution of Low GloSea6, allowing it to be modified during execution in the Fortran 90 environment, and proved its potential for seamless operation.

2. Materials and Methods

2.1. Low-Resolution of GloSea6

The GloSea6 model is a global climate model developed based on the HadGEM3 (Hadley Center Global Environmental Model Version 3) and functions as an ensemble prediction model performed in four major parts. These parts are the pre-processing phase, where input fields for the sub-models Atmosphere, Ocean, Land, and Sea-ice are prepared; the model execution phase, where actual predictions are made through numerical calculations; the ensemble model phase, which compensates for uncertainties; and the validation phase, which evaluates the accuracy of the model. Among these, the model execution phase, where actual predictions are made, is conducted in pairs: Atmosphere with Land and Ocean with Sea-ice. The resolution of the Atmosphere and Land models is 60 km (N216) at mid-latitudes, while the Ocean and Sea-ice models have a resolution of 25 km (eORCA025).

To perform such numerical calculations, the GloSea6 model, which requires high computational power, must utilize a supercomputer. Choi et al. [13] stated that due to the active meteorological research at the Korea Meteorological Administration (KMA), there is a high demand for computational resources to run NWP models. However, when many researchers run models simultaneously, the limited resources of the supercomputer can cause difficulties in model execution. Researchers with lower job priority may be assigned to lower queues, leading to long wait times. Operating GloSea6 on the hardware available at various research institutes and universities has also been challenging due to hardware limitations, resulting in significantly long execution times and inconvenience. To address this, a low-resolution model called Low GloSea6 was developed, enabling the execution of GloSea6 on small to medium-sized servers operated by university communities and research facilities. Low GloSea6 is structured similarly to GloSea6 (comprising four sub-models), but its grid size has been expanded from 60 km to a maximum of 170 km, lowering the resolution and allowing it to run on lower-spec servers compared to a supercomputer. Table 2 shows the resolutions of the GloSea6 model currently used on the KMA supercomputer and the low-resolution coupled model, Low GloSea6, used in this study.

The detailed Atmosphere model of GloSea6 operated by KMA adopts the Unified Model (UM) developed by the UK Met Office as shown by Walters et al. [14]. The science configuration of the UM is composed of Global Atmosphere 7.2 (GA7.2). The dynamical core of the UM, Even Newer Dynamics for General Atmospheric Modelling of the Environment (ENDGame), uses a semi-implicit semi-Lagrangian numerical scheme for advection calculations. The primary variables predicted in the atmospheric model are the three-dimensional wind components, virtual dry potential temperature, Exner pressure, and dry density. At the same time, moist prognostics are advected as free tracers. Using ENDGame, the UM features a dual iterative structure divided into an outer and inner loop for each time step.

2.1.1. Profiling Low GloSea6 for Target Selection in Deep Learning Applications

In this study, we analyzed the UM, the Atmosphere model of Low GloSea6. We used VTune Profiler [15], developed and provided by Intel, which offers tools for application performance, system performance, and optimization. Using Intel VTune Profiler, we analyzed the UM model’s executable file, “um-atmos.exe”, based on CPU usage, and considered the module that consumed the most CPU time as the hotspot.

To analyze the “um-atmos.exe” file, we performed some preliminary tasks. As mentioned in Section 1, Low GloSea6, like GloSea6, consists of multiple processes. To manage this, Low GloSea6 is run using ROSE [16] and CYLC [17] software implemented with Jinja2 [18]. ROSE manages tasks to be executed through the suite.rc file and supports a GUI for easy configuration. CYLC is a workflow engine that sequentially executes tasks declared in the suite.rc file configured by ROSE. These two programs are used to configure and manage Low GloSea6 efficiently. Figure 1 shows the execution process of Low GloSea6 as seen in the CYLC GUI. However, since VTune Profiler provides analysis by directly allocating exe files by default, we added some options to the CYLC suite file to analyze “um-atmos.exe” within the ROSE workflow. The “um-atmos.exe” file is executed under the “UM_MODEL” process shown in Figure 1. Therefore, we added an option to start the VTune analysis in the execution specification section of the relevant process in the rose-suite file.

We used a server with the hardware specifications listed in Table 3 for the analysis of Low GloSea6 using VTune Profiler. The analysis results showed that the Elapsed Time for a one-time step was 427.722 s, and the total CPU Time was measured to be 29,877.390 s, as shown in Table 4. In Table 5, when analyzing the CPU Time for each module and function of Low GloSea6, the first, second, and fourth most time-consuming were MPI modules related to parallel processing, the third was an external module, and the fifth was the “_tri_sor_mod_MOD_tri_sor_dp_dp._omp_fn7” module of “um-atmos.exe”. Therefore, we selected the tri_sor.F90 module as the hotspot of the “um-atmos.exe” model, with the CPU Time of this function being 1125.987 s, accounting for 3.8% of the total CPU Time.

2.1.2. Hotspot Identified in Low GloSea6: 3D Successive Over-Relaxation

The “_tri_sor_mod_MOD_tri_sor_dp_dp._omp_fn7” module, detected as a hotspot during the CPU Time analysis of Low GloSea6 by module, exists in the atmosphere model, one of the four sub-models constituting Low GloSea6. This module is located in the inner loop of the dual iterative structure of the dynamical core, ENDGame. Specifically, the tri_sor.F90 module is used as a preconditioner for the Biconjugate Gradient Stabilized method (BiCGStab), which is a convergent solution method for solving the linear Helmholtz problem. The hotspot identified in the atmosphere model analysis resides within this tri_sor.F90 module.

Low GloSea6 offers three selectable preconditioner options to accelerate the convergence speed. These three options are the Diagonal preconditioner, Jacobi preconditioner, and the Successive Over-Relaxation (SOR) type preconditioner, specifically the tri_sor.F90 module used in this study. Among them, the SOR type preconditioner has demonstrated the best convergence speed and is therefore adopted as the default option. The tri_sor.F90 is a three-dimensional SOR-based configuration that uses the red-black SOR method to solve linear systems. Previous studies by Tae [19], Mittal [20], and Allaviranloo [21] have utilized computationally intensive SOR-based iterative solvers to address large-scale linear and nonlinear systems with vast amounts of data, and the application of the red-black method has shown high efficiency in accelerating the convergence speed in systems designed to solve large-scale linear problems. Therefore, similarly, we adopted the SOR type preconditioner as the default option and selected it as the primary hotspot for this study.

Figure 2 illustrates the structure within “um-atmos.exe” where the tri_sor module is located. First, atm_step_4A manages the overall structure for each timestep. It begins by loading the initial fields to prepare for model execution and then proceeds with the dual-loop structure of the dynamical core, ENDGame. During this process, the outer loop performs two iterations, handling the semi-Lagrangian departure point equations. Following this, the Helmholtz problem is computed to calculate the pressure increment, which also involves a two-iteration structure, referred to as the inner loop. Each inner loop solves large-scale asymmetric linear systems using the BiCGStab method, which employs an iterative convergence approach. Among these, the tri_sor module, which is a three-dimensional SOR-based method, is used to accelerate the convergence of these iterations. After that, the calculated pressure increment is used to back-substitute and estimate the desired meteorological variables.

Due to this iterative structure, atm_step_4A is executed approximately 71 times per timestep, and within each execution of atm_step_4A, the BiCGStab method is performed 4 times. Additionally, the tri_sor module, identified as a hotspot, is continuously invoked during the convergence iteration process of BiCGStab. Depending on the difficulty of convergence, tri_sor is called a minimum of about 9 times to a maximum of about 260 times. Consequently, tri_sor is invoked approximately 2556 to 73,840 times per timestep, which significantly increases CPU time consumption due to the limitations of the iterative convergence approach. Therefore, in this study, we analyzed Low GloSea6 using VTune Profiler and confirmed that the numerical computations designed to solve large-scale linear system problems require tens of thousands of function calls due to the iterative convergence method. This numerical approach inherently involves a high volume of function calls.

Furthermore, this study proposes a framework to reduce execution time by integrating identified hotspots with a deep learning approach to accelerate the convergence of excessive iteration cycles. To achieve this, we explore the target areas for deep learning substitution in Section 2.1.3, while the data collection process for deep learning training is addressed in Section 2.2.

2.1.3. Biconjugate Gradient Stabilized Method

The BiCGStab method solver introduced by Vorst [22] is actively used in the computation of asymmetric linear systems, such as those found in computational fluid dynamics and image processing. Numerous previous studies by Vorst [22], Wang et al. [23], Long et al. [24], and Havdiak et al. [25] have employed this numerical computation method to solve linear systems, demonstrating high efficiency in High-Performance Computing (HPC) parallel processing systems. According to Havdiak et al. [25], the BiCGStab method is an improved version of the Biconjugate Gradient (BiCG) method and the Conjugate Gradient Squared (CGS) method, designed to enhance speed while mitigating the irregular convergence often encountered with the CGS method. One of the approaches used in the BiCGStab method involves utilizing the residual vector, where minimizing the magnitude of this residual vector leads to a smoother convergence process.

In Low GloSea6, the BiCGStab solver is employed as a numerical computation approach to accelerate the calculation process of pressure increments for the Helmholtz equation. Joly et al. [26] stated that solving asymmetric and large-scale three-dimensional linear systems, such as the Helmholtz equation, using traditional direct methods would require an astronomical amount of computational resources, making it practically impossible to solve the problem. As a result, iterative methods must be employed. As shown in Figure 2, Low GloSea6 performs computations for the vertical and horizontal wind components in the outer loop of the double-loop structure of ENDGame. During this process, the Helmholtz procedure calculates the pressure increments using the BiCGStab method, which is invoked a total of four times to accelerate the computation of the Helmholtz problem. Numerically, BiCGStab is a solver that resolves “

A x = b

” iteratively to converge and determine the pressure increments. A detailed explanation of the variables within BiCGStab for solving the Helmholtz problem in Low GloSea6 used in this study is provided in the discussion below.

The input variables for BiCGStab are “x” and “b”. The variable “x” represents the deviation of the Exner function value at the next time step relative to the reference Exner field, precomputed in the previous step. The use of the Exner function is motivated by the need for computational acceleration in BiCGStab, as it reduces the computational cost by converting the pressure field into a dimensionless variable and simplifying the governing equations through non-dimensionalization. The variable “b” corresponds to the right-hand side of the linear equation “

A x = b

”. It is precomputed by combining wind, density, potential temperature, and pressure as inputs. Other variables are computed internally within BiCGStab to ensure convergence and are iteratively updated during each step. Among these, the residual variable “r” measures how well the current estimate “x” satisfies the right-hand side “b”. The variable “p” represents the optimal direction to reduce the residual “r” and is computed using a preconditioner in Low GloSea6. The preconditioner employed here is the “tri_sor” module, identified as a hotspot through prior profiling. Additionally, there are correction-related variables such as “v”, “s”, and “t”, which are used to update the residual during the iterative process. In this study, we aim to analyze the correlations between the input variables “x” and “b” and the internally computed variables within BiCGStab. These correlations will be utilized for deep learning training. The detailed methodology for this analysis is presented in Section 2.2.

The BiCGStab method is designed to ensure numerical convergence of the sparse matrix system and was chosen over methods like BiCG and CGS to reduce unnecessary memory allocation in parallel processing systems. The BiCGStab method performs two matrix-vector multiplications for the matrix and its transpose and requires four inner product operations per iteration. According to Wang et al. [27], compared to the CGS method, BiCGStab demands twice as many computations but offers similar or slightly faster convergence speeds. Low GloSea6 is a parallel processing-based NWP model and utilizes OpenMP to establish a parallel memory framework. This feature allows it to save on redundant copying of distributed memory and additional communication costs during the matrix-vector multiplication operations used in BiCGStab.

However, Long et al. [24] applied machine learning instead of the BiCGStab method to investigate the quantitative relationships of 3D-network heat conduction pathways and foam-like structures, which have broad applications in thermal management. Previously, numerical simulations combining Fast Fourier Transform (FFT) and the BiCGStab method were used, but with advancements in hardware technology, they built a large-scale database obtained through traditional methods and trained and evaluated several machine learning models. Using a Dell Precision 7920 workstation, it took 150 h to compute 704 high thermal conductivity structures using the traditional BiCGStab method. However, when using an Artificial Neural Network (ANN), the computation time was reduced to just a few minutes. The machine learning model enabled the efficient construction of new foam-like structures with accelerated performance.

During the execution of Low GloSea6, we analyzed the “UM_MODEL” process shown in Figure 1, specifically focusing on the “um-atmos.exe” program of the atmospheric model. By combining the profiling results, which identified the “tri_sor” module as a hotspot, with several other studies, we verified the feasibility of replacing the BiCGStab method, used to solve the linear system of the Helmholtz problem located in the inner loop of the dynamical core (ENDGame) in Low GloSea6’s atmospheric model, with a deep learning-based solver. This confirmed that deep learning could be applied within the existing NWP operational process. Through this approach, as shown in Table 4, we aimed to substitute the frequent calls to the “tri_sor” module, which accounts for 3.8% of the total CPU time, with a deep learning method instead of the traditional numerical iterative convergence approach. We propose a framework that integrates the BiCGStab solver with a deep learning model to accelerate the computational process of the Helmholtz procedure within the dynamical core, ENDGame. The data collection and preprocessing of BiCGStab for deep learning training are discussed in Section 2.2.

2.2. Data

To build the dataset for machine learning training, we collected the parameters of the BiCGStab method during the execution of Low GloSea6 using the initial condition “20181117T00.” The “20181117T00” initial condition is a setting that reconstructs the weather data from November 17, 2018, at 00:00. Starting from this date, weather forecasts were generated at 1-day intervals for 10 timesteps, and the corresponding data were collected. Thus, we gathered data over a total of 10 days, but only the data from the first timestep were used for machine learning training. Additionally, Low GloSea6 supports parallel processing, and the BiCGStab method requires parallel computation. Therefore, we pre-configured the NWP model to use 16 cores during execution and collected data individually for each thread.

Furthermore, as described in Section 2.1.3, we analyzed the correlations among the variables used during the BiCGStab convergence process, beyond the input variables “x” and “b”, to identify an additional feature for the deep learning model. As a result, we selected one more input feature, bringing the total number of features used for model training to three. Furthermore, we confirmed that each variable updates iteratively during the BiCGStab convergence process as their values gradually approach convergence with each iteration. Based on this observation, we selected six features used during the initial 1–2 iterations as the input for the deep learning model. Here, the term “feature,” following Brownlee [28], refers to the variables used during model training in machine learning approaches. It may not directly correspond to meteorological variables [29] but rather represents the input or output variables of a machine learning or deep learning model, which could differ slightly in interpretation.

Figure 3 illustrates the correlation heatmap among the variables directly used in the BiCGStab process. The detailed procedure for the correlation analysis based on the heatmap is as follows. Among the variables from the first two iterations of BiCGStab, more than half showed near-zero correlations with the output variable. Additionally, variables such as “

c r

”, “r”, “v”, and “t” exhibited no changes during the first iteration as they were not utilized at this stage. Consequently, we selected variables with a correlation of 0.05 or higher, which displayed very weak correlations, as input features for deep learning training. These variables include “

x 1

”, “

b 1

”, “

p 1

”, “

x 2

”, “

b 2

”, and “

p 2

”.

The variables used in the analysis and for training the deep learning model differ from the initial input variables of BiCGStab. These variables are either updated iteratively in a converging direction or utilized in the correction and numerical computation processes during convergence. They are latent or temporary storage variables and, therefore, do not have direct associations with specific meteorological variables. By leveraging the high potential of artificial neural network structures under machine learning and deep learning approaches, we aimed to capture subtle changes during the initial convergence iterations of BiCGStab. This allows the deep learning model to predict results similar to the actual final convergence output without completing the entire iterative process. Specifically, instead of resolving the Exner perturbation “x” in the initial pressure field to satisfy the current physical constraint state “b” (precomputed using wind, density, and temperature fields), the deep learning model functions as a surrogate correction mechanism to perform the Exner perturbation adjustment directly. Another goal of this study is to integrate the deep learning model into the NWP framework without increasing the difficulty of physical interpretation. To achieve this, we focused on replacing computationally expensive hotspots, such as BiCGStab, which are not directly related to meteorological variables but consume significant execution time during NWP operations. To ensure the selected variables are distinguishable based on BiCGStab iterations, we redefined the variable names. Table 6 provides a detailed explanation of these redefined variables. Here, “x” represents the Exner value after BiCGStab convergence, which is the target predicted by the deep learning model.

As shown in Figure 4a, each feature has the same three-dimensional structure size as the initial condition, which is pre-configured to match the parallel processing settings. The total grid size used in Low GloSea6 is 200 (width) × 152 (height) × 85 (depth). We divided the entire grid equally in the vertical direction and assigned it to 16 cores for parallel processing. As a result, the grid allocated to each core’s process is a 3D grid with a size of 50 (width) × 38 (height) × 85 (depth), containing a total of 161,500 cells per process. However, for convolution operations, the size must be a multiple of 2 to perform max pooling. Therefore, as shown in Figure 4b, we used trilinear interpolation to convert the size to 48 (width) × 64 (height) × 96 (depth), resulting in 294,912 cells in total.

2.3. Deep Learning Approach Methodology

2.3.1. Lightweight 3D U-Net Architecture

Recent studies have achieved excellent performance by applying deep convolutional networks to the weather prediction process. Among these, research on weather variable prediction based on the U-Net model introduced by Ronneberger et al. [30] has been actively conducted. The U-Net model was originally proposed for image segmentation tasks in the biomedical field. It is an extended version of the Fully Convolutional Network (FCN), as proposed by Long et al. [31], characterized by a U-shaped architecture where the encoder and decoder processes are symmetric, with more feature channels present during the upsampling process. The model does not have fully connected layers, allowing it to utilize the valid feature information from each convolution. This enables the use of image information at various levels, leading to superior performance in segmentation tasks, as demonstrated by its outstanding results in the ISBI Cell Tracking Challenge 2015 on the 2D transmitted light dataset.

Figure 5 provides a detailed explanation of the U-Net architecture. This architecture is mainly composed of the contracting path on the left, which captures information, and the expansive path on the right, which restores the captured information. In the blocks of the contracting path, convolution operations are applied, followed by downsampling using max pooling, which then passes the data to the next level. At this point, the number of channels doubles, allowing more global information to be captured. In the blocks of the expansive path, the feature maps from the previous stage are upsampled, and then the number of channels is halved through convolution operations. Simultaneously, the upsampled feature maps are combined with the feature maps from the corresponding level in the contracting path. Finally, after applying convolution operations to the merged feature maps, the last layer uses a 1 × 1 convolution to map the output to the desired number of classes.

Although the U-Net model was originally proposed as an image segmentation architecture in the biomedical field, its ability to extract features at various levels and deliver excellent performance has led to its application in numerous studies and tasks. In the meteorological field, there have been several studies that aimed to predict weather variables by converting the U-Net architecture into a regressor. Kwok et al. [2] utilized a U-Net-based architecture to predict weather variables using the dataset from the geostationary weather satellite Metosat, as part of the Weather4cast project. Inspired by the solution that placed fourth in the Traffic4cast 2020 competition, which shares similarities with Weather4cast, they implemented a Variational U-Net structure. To achieve this, they applied a Variational Autoencoder (VAE) style configuration at the bottleneck of the U-Net structure, specifically at the end of the encoder and the beginning of the decoder. At the end of the encoder, the data are reduced to a vector of size 512, representing the mean and standard deviation as interpreted by the VAE. This extracted vector is then reconstructed into an image under the assumption that the latent variable follows a Gaussian distribution and is passed through the decoder. Another study introduced by Kaparakis et al. [32] proposed the Weather Fusion UNet (WF-UNet), a modified model based on the U-Net architecture, for the task of precipitation nowcasting. Unlike the traditional U-Net model, the WF-UNet employs 3D convolutional layers. These 3D convolutions allow the model to extract not only spatial information from a single radar image but also temporal information from previous timesteps. Additionally, unlike other studies, this approach uses both precipitation data and wind speed radar images as input to train individual U-Net models. The features extracted from each model are then combined, and further convolutional operations are performed to derive the final prediction.

Although many studies in meteorology have explored U-Net architecture-based models, such as those by Kim [33] and Fernandez [34], additional computations or modules have often been added to enhance accuracy, increasing the complexity of the models. However, since our goal is to substitute certain modules of the existing NWP model with a deep learning-based solver, using a heavy convolutional network would double the execution time, defeating the purpose of the substitution. This study aims to verify the feasibility of substituting specific parts of the NWP internal process with deep learning models. We propose a lightweight 3D U-Net model that can substitute the BiCGStab method, a hotspot in Low GloSea6, while achieving similar computational performance to the original. Low GloSea6 can be run on KMA and small to medium-sized servers.

To achieve this, we introduce the CBAM-based Half-UNet (CH-UNet), a modification of the Half-UNet proposed by Lu et al. [35], which simplifies the decoder structure of the U-Net to reduce model complexity. As shown in Figure 6, similar to Half-UNet introduced by Lu et al. [35], the decoder process in the original U-Net is simplified by combining the full-scale computations into a single step. Additionally, although the number of channels was doubled when downscaling the image during the encoding process, they are unified into the same number of channels to simplify the network further. Furthermore, a Ghost module was introduced to generate the same feature maps at a lower cost. The proposed CH-UNet, shown in Figure 7, is inspired by the lightweight Half-UNet model. It has a similar overall architecture, with all convolution operations performed as 3D convolutions to match the characteristics of the data. In traditional U-Net or Half-UNet architectures, the initial channels of the encoding blocks are either 64 or 32. In the case of U-Net, if the initial channels are 64 and the model undergoes four levels of downscaling, the bottleneck layer would have 1024 channels. Although Half-UNet uses the same number of channels, it remains too heavy when considering the 64 channels for application in NWP. Therefore, as shown in Figure 7, we used only three levels and drastically reduced the number of channels, setting the initial number of channels to 8. Following the Half-UNet architecture, we kept the number of channels consistent across all levels. Additionally, we utilized the decoding process of Half-UNet by upsampling the feature maps from all levels to the original image size and combining them, thereby reducing computational costs. The Ghost module was not used.

After this simplified decoding process, we added a Convolutional Block Attention Module (CBAM) introduced by Woo et al. [36]. As shown in Figure 8, CBAM is an attention module that can be applied to convolution operations, placing the Channel Attention Module and Spatial Attention Module in parallel to capture which channels and positions to focus on, thereby improving accuracy with minimal cost. Therefore, at the end of the decoding process, we added CBAM and then applied a 1 × 1 convolution operation. The output was directly used for the regression task.

2.3.2. Deep Learning Utilization Method for NWP Models

The trained deep learning model is designed solely to modify BiCGStab. Therefore, to apply it to the NWP model, it is necessary to adapt the model to the execution environment of the NWP. We converted the model to fit the Fortran 90 environment for application to the UM model of Low GloSea6.

In this study, Python 3.8.19 and PyTorch 2.2.2 were used for training various deep learning models. To integrate the trained model with Low GloSea6, we used the FTorch library provided by Cambridge-ICCS [37]. This allowed us to load the trained deep learning model, written in Python, into Low GloSea6 within a Fortran environment. FTorch provides a library that enables models created and saved in Python-based PyTorch to be directly integrated into Fortran code using the Torch C++ interface, libtorch, without needing to call the Python executable.

We followed the FTorch documentation specifications precisely to convert and integrate the deep network model into Fortran. To combine it with Low GloSea6, we performed the following additional steps. First, as shown in Figure 1, Low GloSea6 integrates hundreds of Fortran code modules during the “KMA_LINUX” stage, which are then compiled using a make build process. This process results in the building of the um-atmos.exe file, which is executed in the “UM_MODEL” stage. Therefore, we needed to write code that utilizes the model converted by FTorch and include it in the “KMA_LINUX” stage. However, FTorch requires gfortran version 11, while Low GloSea6 uses version 9. To address this, we pre-built the FTorch library to obtain the necessary mod files. Additionally, the “torch_tensor_from_array” function provided by FTorch supports up to 4 dimensions, but we needed to perform 3D-based convolution operations, which require a 5-dimensional tensor (batch, channel, depth, height, width). Therefore, we modified the “torch_tensor_from_array” function to support 5-dimensional tensors before the build process. Next, to use the pre-built FTorch library’s mod files during the make build process in Low GloSea6, we specified the path to these mod files in the ROSE make config file using a flag option. This allowed us to successfully integrate the deep network model, originally written in Python-based PyTorch, into the Fortran-based NWP model, Low GloSea6.

As mentioned in Section 2.1.3, the BiCGStab method is called four times in the UM model of Low GloSea6 within the dual loop structure of ENDGame. We analyzed the execution time of each iteration, as shown in Table 7. During one timestep, BiCGStab was called a total of 4609 times, and we measured and averaged the CPU time required for BiCGStab computations in each loop. The most time-consuming was the first outer loop and the first inner loop, taking an average of 1.9284 s, while the second outer loop and the second inner loop required the least CPU time, at 0.1548 s. As shown in the Table 7, the first outer and first inner loops took more than twice as long as the other loops. We concluded that applying the deep network in the other cases, rather than in the first outer and first inner loops, could lead to increased execution times due to the complexity of the network, potentially negating the intended benefits of replacing BiCGStab.

Therefore, as illustrated in Figure 9, we integrated the deep network model converted with the FTorch library into the first outer and first inner loops. After performing computations with the deep network model, we proceeded with BiCGStab computations again. This additional step was necessary because using only the deep network model did not fully adhere to the physical constraints of numerical computations, leading to errors in other modules. Therefore, we had to continue using the traditional numerically based convergence method. However, in line with the goals of this study, we successfully integrated the deep learning model into Low GloSea6, the NWP model currently used by KMA, and were able to apply it to the most CPU-intensive parts—namely, the first outer and first inner loops of BiCGStab—without significant performance degradation. Additionally, by conducting further BiCGStab computations, we ensured that the physical constraints of the NWP model were faithfully maintained.

3. Experiments

3.1. Deep Learning Model Experiments

3.1.1. Experimental Environments

As specified in Section 2.2, we validate the CH-UNet model using the dataset collected from the existing BiCGStab method in Low GloSea6. The dataset was collected at 1-day intervals for 10 timesteps starting from 20181117T00. Additionally, in the grid architecture used by the BiCGStab method, to allow seamless tiling of the output feature map, we applied a 2 × 2 max pooling operation to even-sized matrices during the encoding process. As shown in Figure 6, we resized the model’s input and output data using trilinear interpolation to 48 (width) × 64 (height) × 96 (depth). To ensure efficient learning, we applied min-max scaling to normalize all features to the same distribution range for training purposes.

Each timestep contains 4608 sets of 3D grid data. Of these, 1152 sets from the first outer loop and first inner loop were used for training. For each timestep, the data from the first outer and inner loops were used, and for the first day’s timestep, the 3D grid data were randomly divided into 922 sets for the training dataset and 230 sets for the validation dataset at a 4:1 ratio. Timesteps from day 2 to day 10 were used for evaluation. The server specifications used for training are the same as those in Table 8. To ensure a fair comparison, all networks applied early stopping, halting training if there was no improvement in the validation loss for up to 30 epochs. The same Adaptive Moment Estimation (Adam) optimizer was used for training, with an initial learning rate of 0.0001. All networks completed training within 1250 epochs. During training, the mini-batch size was set3829 to 8, and the loss function used for each model was the RMSE loss. Unlike the other proposed models, the UNet architecture drastically reduced the number of channels. For a fair comparison, we set the number of channels in the first encoding block to 8 and configured each model to follow its respective rules during the downscaling process.

3.1.2. Evaluation Metrics

In this paper, we selected the root mean squared error (RMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE), weighted absolute percentage error (WAPE), and R² score as the evaluation metrics for the regression model trained on the Low GloSea6 BiCGStab method data. The BiCGStab method’s numerical computation was set as the ground truth for training, and the data used for both training and prediction were 3D grid data. The goal was to predict evenly across all grids, so the 3D data were flattened for evaluation. RMSE was chosen because it adds a square root to the mean squared error (MSE), addressing the drawback of MSE being sensitive to outliers. Since RMSE indicates the error compared to the ground truth, lower values represent higher accuracy. RMSE is calculated as shown in (1).

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\hat{y_{i}} - y_{i})}^{2}} .

(1)

MAE represents the average of all absolute errors, and like RMSE, it indicates the error compared to the ground truth, meaning that lower values indicate higher accuracy. However, MAE is more robust to outliers than RMSE, meaning that error values are relatively less affected by outliers. The calculation of MAE is shown in (2).

M A E = \frac{1}{n} \sum_{i = 1}^{n} |\hat{y_{i}} - y_{i}| .

(2)

SMAPE was designed to address the limitations of the mean absolute percentage error (MAPE). Due to the characteristics of the data used in this study, which include zero or negative values, MAPE could not be used. Therefore, SMAPE was chosen, as it provides probability values between 0 and 200%, making the results easier to interpret. By representing performance as a ratio, it allows for easy evaluation of the values used in the BiCGStab method, rather than temperature or pressure. A lower SMAPE value indicates higher accuracy. The calculation of SMAPE is shown in (3).

S M A P E = \sum_{i = 1}^{n} \frac{|\hat{y_{i}} - y_{i}|}{(|{\hat{y}}_{i}| - |y_{i}|) / 2} .

(3)

WAPE was similarly considered to address the limitations of MAPE. WAPE gives less consideration to the absolute value of the units. Since the result values of the BiCGStab method show significant differences in magnitude, WAPE allows us to obtain a performance metric that considers appropriate weighting for each value. A lower WAPE value indicates higher accuracy, and the calculation is shown in (4).

W A P E = \frac{\sum_{i = 1}^{n} |{\hat{y}}_{i} - y_{i}|}{\sum_{i = 1}^{n} |y_{i}|} .

(4)

The R² score is the coefficient of determination, which represents the squared value of the correlation coefficient. However, unlike the correlation coefficient, the R² score quantifies the extent to which variables influence each other. The R² score ranges from 0 to 1, and the closer it is to 1, the higher the correlation of the regression model. The formula for the R² score is shown in (5).

R^{2} s c o r e = \frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - {\bar{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - {\bar{y}}_{i})}^{2}} .

(5)

3.1.3. Experimental Results

We compared the proposed CH-UNet model, designed to substitute the BiCGStab method in the atmospheric model of Low GloSea6, with U-Net, U-Net3+, and Half-UNet. Parameters and floating point operations (FLOPs) were used as metrics to represent the computational cost of each model. The aforementioned RMSE, MAE, SMAPE, WAPE, and R² score were used as metrics to compare the accuracy of these models against the ground truth, which was the BiCGStab method through numerical computation. Table 9 presents the quantitative comparison results.

The proposed CH-UNet architecture and the existing U-Net-based architectures were configured in two main ways. In the decoding process, which reconstructs the extracted information, we evaluated both transposed convolution and trilinear interpolation as upsampling methods for expanding the feature maps. When comparing the number of parameters and FLOPs, it was generally observed that the transposed convolution method resulted in higher values than the trilinear interpolation method. This is because trilinear interpolation does not involve learning for the interpolation process. When evaluating models based on transposed convolution, the FLOPs comparison across models showed that the proposed CH-UNet had a slight increase in computational cost compared to UNet and Half-UNet. However, since the number of channels was drastically reduced to 8, this increase in FLOPs is unlikely to result in significant time differences during actual computations, which will be detailed in Section 3.2.

When evaluating the accuracy of each architecture by comparing their results against the numerical BiCGStab method as the ground truth, the Half-UNet and the proposed CH-UNet models clearly demonstrated superior accuracy when using transposed convolution-based upsampling. With transposed convolution, the RMSE improved by 10% for Half-UNet and 40% for the proposed CH-UNet. U-Net and U-Net3+ showed better accuracy when using trilinear interpolation, with similar or contrasting results. Overall, when comparing the models, the proposed CH-UNet with transposed convolution outperformed all others in every accuracy evaluation metric. Specifically, it showed 7.4% higher correlation in R² score compared to the Half-UNet with trilinear interpolation, which had the lowest computational cost, and 3.3% higher correlation compared to the second-best U-Net model in terms of R² score.

Since the BiCGStab method was performed once more after the deep network model predictions to meet the physical constraints of the NWP model, it is more efficient in terms of computational cost to adopt the proposed CH-UNet model, which offers the best balance of low computational cost and high accuracy, rather than selecting the architecture with the lowest computational cost. This will be demonstrated in Section 3.2.

3.2. Experiments on the Utilization of Deep Learning in NWP Models

3.2.1. Experimental Environments

We aim to verify the feasibility of integrating a deep network model into the existing NWP model by combining it with the Low GloSea6 model, which is used and distributed by the Korea Meteorological Administration (KMA), to supplant certain components of the current NWP model. In Section 3.1, we selected four models for application, based on their performance: the proposed CH-UNet with transposed convolution, which showed the best accuracy in terms of RMSE, the U-Net with trilinear interpolation, which had the second-best accuracy, and the Half-UNet architecture with both trilinear interpolation and transposed convolution, which had the fewest FLOPs. These four models were applied to the UM model of Low GloSea6, as shown in Figure 9, and their performance was compared and evaluated against the original NWP method. To integrate the models trained in PyTorch into the Fortran-based Low GloSea6, we used the FTorch library [37]. Additionally, we configured the computational environment using Rose and Cylc, as shown in Table 10, setting up server nodes for computational tasks and client nodes for task requests and options, ensuring that the execution was performed in the same manner as the original setup.

3.2.2. Experimental Results

We evaluated the medium-to-long-term accuracy of the proposed CH-UNet model and other deep network models’ predictions for timesteps 2 to 10, using the BiCGStab results of the existing NWP model as the ground truth. This demonstrates that models trained on data from only the first timestep can be generally applicable across multiple timesteps. Therefore, we compared the RMSE for each timestep, as shown in Figure 10. In all timesteps, the CH-UNet architecture with transposed convolution upsampling showed the highest accuracy, with an average of 10.24% lower RMSE compared to the Half-Unet with trilinear interpolation upsampling, which had the lowest FLOPs. However, while the U-Net architecture showed the second-lowest RMSE at timestep 2, its RMSE increased afterward, showing the highest RMSE from timestep 3 onward. This indicates that the proposed CH-UNet does not only perform well for shorter timesteps but also maintains stable and superior accuracy over longer timesteps.

Next, we compared the total execution time of the “um-atmos.exe” atmospheric model executable in the “UM_MODEL” process, as shown in Figure 1, on the server node specified in Table 10, which performs the computational tasks of Low GloSea6. Excluding timestep 1, which was used for training the deep network models, we measured the execution time for each timestep from 2 to 10. Figure 10 shows the graph of execution times for each model across timesteps.

When comparing the execution time of the “um-atmos.exe” atmospheric model executable in Low GloSea6, the proposed CH-UNet with transposed convolution upsampling demonstrated faster execution times than all other deep networks over the entire period. It also showed an average of 6.8% reduced execution time compared to the Half-UNet architecture with trilinear interpolation upsampling, which had the lowest FLOPs. This indicates that the proposed CH-UNet, with slightly higher computational cost but superior accuracy, performs better in terms of execution time when integrated into the NWP model to meet physical constraints compared to deep networks with lower FLOPs. When comparing execution times with the existing NWP model, the model integrated with the proposed CH-UNet reduced execution time by up to 78 s, or an average of 2.6%, during the medium-range period from timesteps 3 to 10. In contrast, other deep network models showed slower execution times than the NWP method across all timesteps. By integrating a deep network with excellent accuracy and reasonably low computational cost into the existing NWP model, the computational load for solving the Helmholtz problem, used to calculate pressure increments, was reduced. This, in turn, decreased the number of calls to the tri_sor module, which had been identified as a hotspot in the profiling results. As a result, the integrated deep network model achieved shorter execution times compared to the NWP model, providing meteorologists with a solution that integrates deep learning without increasing interpretive difficulties.

When synthesizing the results of Figure 10 and Figure 11, as shown in Figure 9, the BiCGStab method had to be performed after the deep network model due to the inevitable need to meet the physical constraints of the NWP model. As a result, the proposed CH-UNet architecture, which has appropriately low FLOPs and superior accuracy, reduced the convergence time of the subsequent BiCGStab method, leading to shorter overall execution times compared to the Half-Unet with trilinear interpolation upsampling, which had the lowest FLOPs, as shown in Figure 10.

4. Conclusions

In this study, we proposed an integration framework between deep network models and the traditional NWP model, Low GloSea6, to reduce execution time. During the profiling of Low GloSea6, the tri_sor.F90 module was identified as the hotspot consuming the highest CPU time. To address this, we integrated deep networks with the BiCGStab method while ensuring that the physical interpretability of the meteorological model was not compromised. Additionally, we introduced a lightweight 3D U-Net architecture, CH-UNet. The CH-UNet architecture demonstrated a 10.24% reduction in RMSE compared to the existing Half-UNet and consistently achieved superior accuracy in medium- to long-term predictions compared to other models. Moreover, integrating CH-UNet into Low GloSea6 reduced execution time by up to 78 s, with an average reduction of 2.6% per timestep compared to the original numerically computed Low GloSea6.

This proposed method demonstrates that high-performance computational tasks, such as ensemble model execution, can be achieved without increasing physical interpretation difficulties in small- to medium-sized servers at university research facilities or institutions with limited resources. Moreover, since the proposed framework does not alter the existing structure, it can also be applied to the high-resolution version, GloSea6. Beyond the meteorology and climate modeling fields, this approach can also be applied to other fields that rely on numerical computation, allowing for reduced execution times within limited resources.

We propose two directions for future research. First, in this study, the deep network was trained using data from only a single timestep. By collecting data from a broader range of timesteps and using multiple timesteps for training, the model’s predictions could become more generalized and accurate, leading to further reductions in execution time. Second, in the method proposed in this study, the BiCGStab process, a numerical computation method, is still performed after the deep network to meet physical constraints. If it becomes possible to satisfy these physical constraints using only the deep network process, a significant reduction in execution time could be achieved. To this end, efforts such as relaxing the physical constraints in the atmospheric model based on the deep network’s output could allow the Helmholtz computation to be fully handled by the deep network, further reducing model execution time.

Author Contributions

Conceptualization, H.P.; methodology, H.P.; software, H.P.; validation, H.P.; formal analysis, H.P.; investigation, H.P.; resources, H.P.; data curation, H.P.; writing—original draft preparation, H.P.; writing—review and editing, S.C.; visualization, H.P.; supervision, S.C.; project administration, S.C.; funding acquisition, S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (RS-2024-00463316).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the author.

Acknowledgments

The author extends his appreciation to the National Research Foundation of Korea (NRF), funded by the Ministry of Education, for supporting this research work through the Basic Science Research Program (Project No. RS-2024-00463316).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Schultz, M.G.; Betancourt, C.; Gong, B.; Kleinert, F.; Langguth, M.; Leufen, L.H.; Mozaffari, A.; Stadtler, S. Can deep learning beat numerical weather prediction? Philos. Trans. R. Soc. A 2021, 379, 20200097. [Google Scholar] [CrossRef] [PubMed]
Kwok, P.H.; Qi, Q. A Variational U-Net for Weather Forecasting. arXiv 2021, arXiv:2111.03476. [Google Scholar]
Chen, L.; Du, F.; Hu, Y.; Wang, Z.; Wang, F. SwinRDM: Integrate SwinRNN with Diffusion Model towards High-Resolution and High-Quality Weather Forecasting. Proc. AAAI Conf. Artif. Intell. 2023, 37, 322–330. [Google Scholar] [CrossRef]
Frnda, J.; Durica, M.; Rozhon, J.; Vojtekova, M.; Nedoma, J.; Martinek, R. ECMWF short-term prediction accuracy improvement by deep learning. Sci. Rep. 2022, 12, 7898. [Google Scholar] [CrossRef]
Cho, D.; Yoo, C.; Son, B.; Im, J.; Yoon, D.; Cha, D. A novel ensemble learning for post-processing of NWP Model’s next-day maximum air temperature forecast in summer using deep learning and statistical approaches. Weather Clim. Extrem. 2022, 35, 100410. [Google Scholar] [CrossRef]
Yao, Y.; Zhong, X.; Zheng, Y.; Wang, Z. A Physics-Incorporated Deep Learning Framework for Parameterization of Atmospheric Radiative Transfer. J. Adv. Model. Earth Syst. 2023, 15, e2022MS003445. [Google Scholar] [CrossRef]
Mu, B.; Chen, L.; Yuan, S.; Qin, B. A radiative transfer deep learning model coupled into WRF with a generic fortran torch adaptor. Front. Earth Sci. 2023, 11, 1149566. [Google Scholar] [CrossRef]
Zhong, X.; Ma, Z.; Yao, Y.; Xu, L.; Wu, Y.; Wang, Z. WRF–ML v1. 0: A bridge between WRF v4. 3 and machine learning parameterizations and its application to atmospheric radiative transfer. Geosci. Model Dev. 2023, 16, 199–209. [Google Scholar] [CrossRef]
Chen, G.; Wang, W.-C.; Yang, S.; Wang, Y.; Zhang, F.; Wu, K. A neural network-based scale-adaptive cloud-fraction scheme for GCMs. J. Adv. Model. Earth Syst. 2023, 15, e2022MS003415. [Google Scholar] [CrossRef]
Zhong, X.; Yu, X.; Li, H. Machine learning parameterization of the multi-scale Kain–Fritsch (MSKF) convection scheme and stable simulation coupled in the Weather Research and Forecasting (WRF) model using WRF–ML v1. 0. Geosci. Model Dev. 2024, 17, 3667–3685. [Google Scholar] [CrossRef]
Mu, B.; Zhao, Z.-J.; Yuan, S.-J.; Qin, B.; Dai, G.-K.; Zhou, G.-B. Developing intelligent Earth System Models: An AI framework for replacing sub-modules based on incremental learning and its application. Atmos. Res. 2024, 302, 107306. [Google Scholar] [CrossRef]
Wang, X.; Han, Y.; Xue, W.; Yang, G.; Zhang, G.J. Stable climate simulations using a realistic general circulation model with neural network parameterizations for atmospheric moist physics and radiation processes. Geosci. Model Dev. 2022, 15, 3923–3940. [Google Scholar] [CrossRef]
Choi, S.; Jung, E.S. Optimizing Numerical Weather Prediction Model Performance Using Machine Learning Techniques. IEEE Access 2023, 11, 86038–86055. [Google Scholar] [CrossRef]
Walters, D.; Baran, A.J.; Boutle, I.; Brooks, M.; Earnshaw, P.; Edwards, J.; Furtado, K.; Hill, P.; Lock, A.; Manners, J.; et al. The Met Office Unified Model Global Atmosphere 7.0/7.1 and JULES Global Land 7.0 configurations. Geosci. Model Dev. 2019, 12, 1909–1963. [Google Scholar] [CrossRef]
Intel VTune Profiler. Available online: https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler.html (accessed on 24 June 2024).
ROSE. Available online: https://metomi.github.io/rose/2019.01.8/html/tutorial/rose/index.html (accessed on 8 January 2019).
CYLC Introduction. Available online: https://cylc.github.io/cylc-doc/latest/html/tutorial/introduction.html (accessed on 25 June 2024).
Jinja Introduction. Available online: https://jinja.palletsprojects.com/en/3.0.x/intro/ (accessed on 9 November 2021).
Tee, G.J. Eigenvectors of the Successive Over-Relaxation Process, and its Combination with Chebyshev Semi-Iteration. Comput. J. 1963, 6, 250–263. [Google Scholar] [CrossRef]
Mittal, S. A study of successive over-relaxation method parallelisation over modern HPC languages. Int. J. High Perform. Comput. Netw. 2014, 7, 292–298. [Google Scholar] [CrossRef]
Allaviranloo, T. Successive over relaxation iterative method for fuzzy system of linear equations. Appl. Math. Comput. 2005, 162, 189–196. [Google Scholar] [CrossRef]
Vorst, H.A. Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems. SIAM J. Sci. Stat. Comput. 1992, 13, 631–644. [Google Scholar] [CrossRef]
Wang, M.; Sheu, T. An element-by-element BICGSTAB iterative method for three-dimensional steady Navier-Stokes equations. J. Comput. Appl. Math. 1997, 79, 147–165. [Google Scholar] [CrossRef]
Long, C.; Liu, S.; Sun, R.; Lu, J. Impact of structural characteristics on thermal conductivity of foam structures revealed with machine learning. Comput. Mater. Sci. 2024, 237, 112898. [Google Scholar] [CrossRef]
Havdiak, M.; Aliaga, J.I.; Iakymchuk, R. Robustness and Accuracy in Pipelined Bi-Conjugate Gradient Stabilized Method: A Comparative Study. arXiv 2024, arXiv:2404.13216. [Google Scholar]
Joly, P.; Meurant, G. Complex conjugate gradient methods. Numer. Algorithms 1993, 4, 379–406. [Google Scholar] [CrossRef]
Wang, H.; Liu, F.; Xia, L.; Crozier, S. An efficient impedance method for induced field evaluation based on a stabilized Bi-conjugate gradient algorithm. Phys. Med. Biol. 2008, 53, 6363. [Google Scholar] [CrossRef] [PubMed]
Brownlee, J. How to choose a feature selection method for machine learning. Mach. Learn. Mastery 2019, 10, 1–7. [Google Scholar]
Khairoutdinov, M.F.; Blossey, P.N.; Bretherton, C.S. Global system for atmospheric modeling: Model description and preliminary results. J. Adv. Model. Earth Syst. 2022, 14, e2021MS002968. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014, arXiv:1411.4038. [Google Scholar]
Kaparakis, C.; Mehrkanoon, S. WF-UNet: Weather Fusion UNet for Precipitation Nowcasting. arXiv 2023, arXiv:2302.04102. [Google Scholar]
Kim, T.; Kang, S.; Shin, H.; Yoon, D.; Eom, S.; Shin, K.; Yun, S.Y. Region-conditioned orthogonal 3D U-Net for weather4cast competition. arXiv 2022, arXiv:2212.02059. [Google Scholar]
Fernandez, J.G.; Mehrkanoon, S. Broad-UNet: Multi-scale feature learning for nowcasting tasks. Neural Netw. 2021, 144, 419–427. [Google Scholar] [CrossRef]
Lu, H.; She, Y.; Tie, J.; Xu, S. Half-UNet: A simplified U-Net architecture for medical image segmentation. Front. Neuroinform. 2022, 16, 911679. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
FTorch Documentation. Available online: https://cambridge-iccs.github.io/FTorch/ (accessed on 23 July 2024).

Figure 1. The overall execution process of GloSea6.

Figure 2. Operational structure of “um-atmos.exe”.

Figure 3. Correlation heatmap of variables used in BiCGStab.

Figure 4. Resolution size of the 3D grid data. (a) matches the latitude and longitude grid size of the Low GloSea6 UM model, and (b) is adjusted to be a multiple of 2 to facilitate the upsampling process in the U-Net architecture.

Figure 5. U-Net architecture [30].

Figure 6. Half-UNet architecture [35].

Figure 7. CBAM-based Half-UNet (CH-UNet) architecture.

Figure 8. Overall structure of CBAM and Sub-Attention Modules [36].

Figure 9. Hybrid-DL NWP model structure integrating CH-UNet in the UM model of Low GloSea6.

Figure 10. Comparison of ’um-atmos.exe’ file execution time for each timestep.

Figure 11. Comparison of RMSE for each deep network model’s prediction results during Low GloSea6 execution by Timestep.

Table 1. Model version and resolution of GloSea6.

Model	Version	Grid Size	Resolution
Atmosphere	UM vn11.5	60 km/0.83° × 0.25°	N216L85
Land Surface	JULES vn5.6	60 km/0.83° × 0.25°	N216L4
Ocean	NEMO vn3.6	25 km/0.25° × 0.25°	eORCA025L75
Sea-Ice	CICE vn5.1.2	25 km/0.25° × 0.25°	eORCA025L75

Table 2. The resolution of GloSea6 and Low GloSea6.

Model	Component	Resolution	Grid Size	Grid Degree
GloSea6	UM	N126	60 km	0.83° × 0.25°
GloSea6	NEMO	eORCA025	25 km	∼0.25°
Low GloSea6	UM	N96	170 km	1.88° × 1.25°
Low GloSea6	NEMO	eORCA1	100 km	∼1.0°

Table 3. Hardware specifications used for profiling.

Name	Hardware Specification
CPU	Intel^® Core 10th Gen i7-10700K
RAM	64 GB
SSD	1 TB
GPU	GeForce RTX 3080

Table 4. Profiling results of Low GloSea6 using Intel VTune Profiler.

Item	Value
CPU Time	29,877.390 s
Effective Time	13,390.712 s
Spin Time	16,486.679 s
Overhead Time	0 s
Instructions Retired	206,821,904,600,000
Microarchitecture Usage	57.4%
Total Thread Count	17
Paused Time	3.639 s

Table 5. CPU time profiling by module for Low GloSea6.

CPU Time (s)	Microarchitecture Usage (%)	Module	Function (Full)
13,032.006	79.8%	libmpi.so.12.0.5	MPIDI_CH3I_Progress
3097.394	36.6%	libmpi.so.12.0.5	MPID_nem_tcp_connpoll
1408.185	34.6%	[Unknown]	[Outside any known module]
1283.903	100.0%	libmpi.so.12.0.5	MPIDU_Sched_are_pending
1125.987	13.3%	um-atmos.exe	_tri_sor_mod_MOD
888.238	48.4%	libm-2.17.so	__ieee754_pow_sse2
617.219	63.0%	libm-2.17.so	__ieee754_exp_avx
450.796	41.3%	libm-2.17.so	__exp1
437.765	23.0%	libgfortran.so.4.0.0	func@0x1c270
347.046	22.4%	um-atmos.exe	__mod_cosp_MOD_cosp_iter

Table 6. Description of the selected variables.

Variable	Description
$x 1$	The matrix x updated in the 1st iteration
$b 1$	The matrix corresponding to the right-hand side term in the linear system during the 1st iteration
$p 1$	The direction matrix used in the conjugate gradient method during the 1st iteration
$x 2$	The matrix x updated in the 2nd iteration
$b 2$	The matrix corresponding to the right-hand side term in the linear system during the 2nd iteration
$p 2$	The direction matrix used in the conjugate gradient method during the 2nd iteration
x	The matrix x after the convergence of the solution

Table 7. Average CPU time for BiCGStab in each loop.

Loop Name	CPU Time (s)
Outer 1, Inner 1	1.9284
Outer 1, Inner 2	0.9091
Outer 2, Inner 1	0.5974
Outer 2, Inner 2	0.1548

Table 8. Hardware specifications used for model training.

Name	Hardware Specification
CPU	Intel(R) Xeon(R) Gold 6246R CPU @ 3.40 GHz
RAM	256 GB
SSD	1 TB
GPU	NVIDIA RTX A6000 D6 48 GB × 4 pcs

Table 9. Performance comparison between the proposed CH-UNet and U-Net-based architectures.

Up Sampling	Architecture	Params	FLOPs	RMSE	MAE	SMAPE	WAPE	R² Score
	U-Net	58.9k	3.55G	0.0027	0.0013	0.2192%	0.2191%	0.8100
Transposed	U-Net3+	85.9k	12.07G	0.0032	0.0017	0.2921%	0.2920%	0.7434
Convolution	Half-UNet	13.2k	2.98G	0.0027	0.0014	0.2347%	0.2347%	0.8099
	CH-UNet	15.7k	3.71G	0.0024	0.0010	0.1814%	0.1814%	0.8576
	U-Net	71.1k	4.63G	0.0026	0.0013	0.2211%	0.2210%	0.8296
Trilinear	U-Net3+	85k	10.53G	0.0028	0.0016	0.2816%	0.2815%	0.7985
Interpolation	Half-UNet	12k	2.64G	0.0030	0.0015	0.2600%	0.2599%	0.7661
	CH-UNet	14.5k	3.37G	0.0040	0.0021	0.3713%	0.3713%	0.5891

Table 10. Hardware specifications for measuring the execution time of Low GloSea6.

Server Node		Client Node
Name	Hardware Specification	Name	Hardware Specification
CPU	Intel^® Core 13th Gen i9-13900F	CPU	Intel^® Core 7th Gen i7-7700
RAM	126 GB	RAM	8 GB
SSD	32 TB	SSD	1 TB
GPU	GeForce RTX 4070 Ti	GPU	No GPU

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, H.; Chung, S. Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models. Atmosphere 2025, 16, 60. https://doi.org/10.3390/atmos16010060

AMA Style

Park H, Chung S. Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models. Atmosphere. 2025; 16(1):60. https://doi.org/10.3390/atmos16010060

Chicago/Turabian Style

Park, Hyesung, and Sungwook Chung. 2025. "Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models" Atmosphere 16, no. 1: 60. https://doi.org/10.3390/atmos16010060

APA Style

Park, H., & Chung, S. (2025). Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models. Atmosphere, 16(1), 60. https://doi.org/10.3390/atmos16010060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Utilization of a Lightweight 3D U-Net Model for Reducing Execution Time of Numerical Weather Prediction Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Low-Resolution of GloSea6

2.1.1. Profiling Low GloSea6 for Target Selection in Deep Learning Applications

2.1.2. Hotspot Identified in Low GloSea6: 3D Successive Over-Relaxation

2.1.3. Biconjugate Gradient Stabilized Method

2.2. Data

2.3. Deep Learning Approach Methodology

2.3.1. Lightweight 3D U-Net Architecture

2.3.2. Deep Learning Utilization Method for NWP Models

3. Experiments

3.1. Deep Learning Model Experiments

3.1.1. Experimental Environments

3.1.2. Evaluation Metrics

3.1.3. Experimental Results

3.2. Experiments on the Utilization of Deep Learning in NWP Models

3.2.1. Experimental Environments

3.2.2. Experimental Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI