Open AccessArticle

Evaluation of Machine Learning Assisted Phase Behavior Modelling of Surfactant–Oil–Water Systems

Daulet Magzymov

^1,2,*

Meruyert Makhatova

^1,3,

Zhassulan Dairov

¹ and

Murat Syzdykov

Oil and Gas Department, Atyrau Oil and Gas University, Atyrau 060027, Kazakhstan

Department of Petroleum Engineering, University of Houston, Houston, TX 77023, USA

Petroleum Engineering Department, Colorado School of Mines, Golden, CO 80401, USA

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(1), 100; https://doi.org/10.3390/app15010100

Submission received: 30 October 2024 / Revised: 8 December 2024 / Accepted: 12 December 2024 / Published: 26 December 2024

Download

Browse Figures

Figure 1
Illustration of input parameter range and model output types for machine learning algorithms. "> Figure 2
Illustration of graphical equation-of-state model. Blue cross overall composition is in single-phase region, purple cross is in two-phase equilibrium of microemulsion A and oil, and green cross is in three-phase region where equilibrium is defined by B and C tie-lines. "> Figure 3
Graphical equation-of-state model workflow. "> Figure 4
Physical model performance compared to experimental data, salinity scan. Data from Zhang et al. (2015) [<a href="#B41-applsci-15-00100" class="html-bibr">41</a>]. "> Figure 5
Reference phase behavior based on physical model, to be used for machine learning comparison. "> Figure 6
Machine learning phase behavior predictions (accuracy): fine tree (97.4%), medium tree (93.4%), linear SVM (90.6%), cubic SVM (99.5%), KNN (98.6%), boosted trees (96.9%). "> Figure 7
Phase behavior predicted with machine learning and graphical equation-of-state, ternary diagram at varying salinities: (a) 0.85%wt, (b) 1.15%wt, (c) 1.5%wt. "> Figure 8
Fixed water–oil ratio compositional space, fish plot. HLD = ln(S/Sopt), Sopt = 1.15%wt. "> Figure 9
Machine learning model extrapolation: (a) near extrapolation at 3.5%wt (3%wt is the upper range for tuned data); (b) far extrapolation at 7%wt (3%wt is the upper range for tuned data). "> Figure 10
Illustration of less accuracy with sparse data available for SVM training. "> Figure 11
Graphical equation-of-state model performance compared to experimental data and physical model, salinity scan. Data from Zhang et al. (2015) [<a href="#B41-applsci-15-00100" class="html-bibr">41</a>]. ">

Versions Notes

Abstract

This paper evaluates the ability of machine learning (ML) algorithms to capture and reproduce complex multiphase behavior in surfactant–oil–water systems. The main objective of the paper is to evaluate the ability of machine learning algorithms to capture complex phase behavior of a surfactant–oil–water system in a controlled environment of known data generated via physical models. We evaluated several machine learning algorithms including decision trees, support vector machines (SVMs), k-nearest neighbors, and boosted trees. Moreover, the study integrates a novel graphical equation-of-state model with ML-generated compositional spaces to test ML’s effectiveness in predicting phase transitions and compares its performance to experimental data and a validated physical model. Our results demonstrate that the cubic SVM has the highest accuracy in capturing key behaviors, such as the shrinking of two-phase regions as salinity deviates from optimal conditions, and performs well even in near-extrapolated scenarios. Additionally, the graphical equation-of-state model aligns closely with both experimental data and the physical model, providing a robust framework for analyzing multiphase behavior. We do not suggest that machine learning models should replace traditional physical models, but rather should complement physical models by extending predictive capabilities, especially when experimental data are limited. This hybrid approach offers a promising method for investigating complex multiphase phenomena in surfactant systems.

Keywords:

surfactant; phase behavior; hybrid model; machine learning; graphical equation of state

1. Introduction

Despite extraction efforts during the primary and secondary recovery phases, substantial volumes of crude oil often remain within reservoirs. Enhanced oil recovery (EOR) methods are implemented to increase the efficiency of oil extraction. Among these methods, chemical flooding has gained prominence as an effective strategy for mobilizing residual oil and improving overall recovery factor [1,2,3,4] through four primary mechanisms: reducing interfacial tension, altering wettability, foam generation, and emulsification [5,6,7,8].

Surfactants are complex organic compounds characterized by two distinct features: a water-attracting (hydrophilic) head and a water-repelling (hydrophobic) tail. This duality allows surfactants to orient themselves at oil–water interfaces forming micelles at a certain concentration [9,10], known as the critical micelle concentration (CMC). A key parameter indicating the surfactant’s solubilization in either oil or water and ability to form microemulsions is the hydrophile–lipophile balance (HLB) [11].

Winsor (1948) [12] first classified microemulsions into Type II− (dispersed oil in water microemulsion), Type II+ (dispersed water in continuous oil), Type III (middle-phase microemulsion), and Type IV (single-phase microemulsion). Winsor’s ratio, introduced in 1954 [13], was used as a tool for analyzing the phase behavior in surfactant–oil–water systems, defined as the ratio of the interactions between the surfactant at the interface and the surrounding oil and water molecules. Considering that microemulsion phase behavior is a function of several features, including oil composition, surfactant structure, cosolvent, salinity, pressure, and temperature [14,15,16], HLB and Winsor’s ratio approach have limitations in surfactant formulation optimization [17,18], requiring the understanding and prediction of complex microemulsion phase behavior.

The quantitative prediction of microemulsion phase behavior is inherently complex and has been explored through various approaches, including empirical fitting models, lattice-based statistical thermodynamics, and free energy interaction frameworks. The earliest attempt to describe solute partitioning between two phases was proposed by Hand (1930) [19]. This model was later modified and widely adopted for numerical simulations of enhanced oil recovery processes [2]. However, while these models effectively describe two-phase zones, predictions at the three-phase invariant point remain heavily dependent on empirical data, thus limiting their accuracy.

Salager et al. (1979) [20] introduced a correlation to optimize three-phase systems for oil recovery, considering variables such as salinity, temperature, oil composition, alcohol type and concentration, and surfactant type. This correlation was later defined as the hydrophilic–lipophilic difference (HLD) concept, which quantifies the chemical potential difference between oil and water to characterize surfactant affinity [21]. The HLD framework has been instrumental in quantifying interfacial tension (IFT) minimization [22]. Nevertheless, the application of the HLD model in phase behavior predictions is constrained by the limited degrees of freedom in its formulation.

Acosta (2003) [23] expanded on the HLD concept by integrating it with the net-average curvature (NAC) model, creating an equation of state that incorporates the HLD database to predict microemulsion phase behavior. Despite its utility, this model relies on the assumptions of a constant characteristic length, as the optimum solubilization ratio is insensitive to temperature and pressure. Addressing these limitations, Ghosh and Johns (2016) [24] introduced a novel predictive phase-behavior model that incorporates a pressure-dependent factor into the HLD equation. By integrating this enhanced HLD equation with the NAC model, they successfully predicted critical parameters, including phase volumes, solubilization ratios, and transitions between microemulsion phases, with validation against experimental data. However, challenges remain in addressing non-physical two-phase regions in the model.

Khorsandi and Johns (2016) [25] presented the first flash calculation algorithm, which is non-iterative and robust for three- and two-phase zones, based on the hydrophilic–lipophilic difference net-average curvature (HLD-NAC) framework, capable of modeling all Winsor regions. This algorithm introduced new correlations for the solubilization ratio at optimum formulation, allowing for flexibility with variations in any formulation parameter(s). However, the characteristic length is assumed to remain constant across the three-phase region. Magzymov and Johns (2022) [26] resolved this limitation by introducing a variable characteristic length in their improved microemulsion flash calculation method. The modification ensures thermodynamic consistency and better replicates phase behavior, particularly in the three-phase and adjacent two-phase regions. However, the improved method requires additional equation-of-state parameters, including coefficients for the HLD equation, characteristic length correlations, and critical tie-line parameterization coefficients, all of which must be calibrated against experimental data.

In recent studies, phase behavior calculations have been accelerated by applying ML algorithms [27,28,29], further integrated with fundamental physics to accurately address the complexities of multiphase systems while avoiding non-physical behavior [30,31,32,33]. Magzymov et al. (2024) [34] introduced a hybrid ML model that combines physics-based flash calculations with ML-predicted equilibrium constants for CO₂–oil systems, achieving high accuracy in compositional space predictions.

Several recent studies have focused on employing ML to accelerate microemulsion phase behavior predictions and to estimate surfactant properties based on models or experimental data. Furth et al. (2024) [35] proposed an ML-based workflow for predicting the equivalent alkane carbon number (EACN), a critical parameter in the HLD framework that describes oil phase hydrophobicity but typically requires labor-intensive experimental determination. Similarly, Bell (2016) [36] utilized decision tree methods to predict missing data points in the temperature–composition plane for non-ionic surfactant partitioning. Thacker et al. (2023) [37] explored the use of ML to generate surfactant phase diagrams based on known phase behavior. While these studies provided valuable insights, their effectiveness was constrained by limited datasets, resulting in insufficient chemical space coverage. Despite these limitations, the comparative evaluation of different ML techniques identified support vector machines (SVMs) and neural networks as the most effective for such applications.

The prediction of surfactant system properties, such as interfacial tension (IFT), presents another empirically challenging area that has been addressed by Seddon et al. (2022) [38] and Rashidi-Khaniabadi et al. (2023) [39]. Their studies demonstrated the capability of ML algorithms to analyze surfactant sample datasets and achieve accurate predictions, with R² values ranging from 0.69 to 0.87. By circumventing the need for empirical tuning across diverse surfactant types, these approaches highlight the utility of ML; however, the relatively small datasets (e.g., 390 experimental IFT data points) increase the risk of overfitting, underscoring the need for dataset expansion to improve generalizability.

Talapatra et al. (2024) [40] extended this application by introducing ML regression models to estimate microemulsion viscosity. These models were trained on computational datasets derived from molecular dynamics (MD) simulations, incorporating variables such as pressure, temperature, brine salinity, and surfactant concentration as inputs. Although the model-based generation of training data enabled a more comprehensive dataset, the study’s use of only 462 points and potential overfitting challenge has not been addressed.

The objective of this paper is to assess whether highly non-linear and discretized multiphase behavior in surfactant–oil–water systems can be accurately captured, reproduced, and even reasonably near extrapolated using ML approaches. By focusing on complex phase transitions, including single-phase, two-phase, and three-phase regions, we aim to test whether ML models could effectively represent these intricate behaviors. We use data generated by physical models to ensure an abundance of data that is free from experimental uncertainties for fair comparison and testing of machine learning algorithms. First, we generate data for training and testing of machine learning models. Second, we introduce a graphical equation-of-state model that uses an ML-generated phase diagram to perform phase behavior modelling as a hybrid approach. Third, we present the results of this hybrid approach that combines machine learning and graphical equation of state. By integrating a validated physical model with ML-based predictive frameworks, the study explores ML’s capability to capture phase transitions. The findings, validated against experimental data and physical models, highlight the potential of combining ML with traditional approaches to improve the understanding and prediction of intricate phase behavior under diverse conditions. Lastly, we highlight the main conclusions of the study, its implications, and future research directions.

2. Methodology and Materials

In this section, we describe the tuning of the physical model to generate inputs for machine learning algorithms, followed by a description of the graphical equation-of-state model.

2.1. Data Collection and Physics-Based Model Tuning

We leveraged experimental data from the literature to fine-tune the modified hydrophilic–lipophilic difference net-average curvature (HLD-NAC) model, a physics-based approach for predicting surfactant–oil–water phase behavior [15,19,28]. The experimental data from Zhang et al. (2015) [41] include salinity scans and solubilization ratios of water and oil in the microemulsion phase. These data calibrated the model to ensure it accurately reflects real-world phase behavior for a given system.

The modified HLD-NAC model incorporates key surfactant parameters, such as surfactant structure, oil type, salinity, temperature, and pressure, allowing us to account for the intricate balance between hydrophilic and lipophilic forces. By optimizing the model parameters, we ensured that the model provides robust predictions for the phase behavior of surfactant–brine–oil systems under varying conditions [26]. We note here that the models typically assumed pure excess phases, i.e., the oil phase is pure oil and the aqueous phase is pure water pseudo component.

2.2. Data Generation for Machine Learning

After fine-tuning the modified HLD-NAC model, we used it to generate a comprehensive dataset consisting of 5000 data points. These data points included variations in system parameters, such as surfactant concentration, oil composition, and salinity, representing a wide range of possible conditions in surfactant–oil–water systems. Each data point included corresponding phase behavior outcomes, including the number of phases and compositional details for each phase, see Figure 1. This dataset ensured a balanced representation of the compositional space across all phases, from single-phase regions to complex three-phase microemulsions.

2.3. Machine Learning Model Development

We applied a range of machine learning algorithms to the generated dataset to identify the model best suited for predicting surfactant phase behavior. The primary goal was to develop a model that can accurately predict critical outputs, including the number of phases, phase compositions, and whether the system exhibited multiphase behavior (up to three phases). The dataset, which consisted of 5000 data points representing various combinations of salinity, oil composition, surfactant concentration, and other factors, served as the input for training and validation.

The machine learning algorithms considered in this study include support vector machine (SVM), decision tree, k-nearest neighbor (KNN), and boosted tree models. We applied both fine and medium decision tree models. Decision trees are well-suited for this task as they can handle non-linear relationships between variables and are interpretable, providing clear insights into the decision-making process for phase prediction. The fine tree model was designed to capture subtle variations in the data by creating a large number of branches, while the medium tree model struck a balance between complexity and generalization to avoid overfitting. SVM models are particularly effective in high-dimensional spaces where clear decision boundaries between phases are required. We applied both linear and cubic SVM models. The linear model served as a baseline, offering simplicity and fast computation, while the cubic model provided greater flexibility in capturing complex relationships within the compositional space, making it better suited for identifying transitions between single-phase, two-phase, and three-phase regions. We also tested k-nearest neighbors and boosted trees.

Additionally, the models were tested on their ability to map the compositional space accurately, ensuring they could correctly identify critical points, such as tie-line boundaries and phase transitions. Ultimately, the model that demonstrated the highest accuracy and consistency in predicting phase behavior was selected for further integration with the graphical equation-of-state model.

2.4. Graphical Equation-of-State Model

The graphical equation-of-state model is a novel approach introduced in this paper for capturing the number of phases and their compositions in surfactant–oil–water systems. This method operates by marching through the compositional space, such as along tie lines and tie triangle edges, and identifying transitions between multiphase regions. It is applicable to both machine learning-generated compositional mappings and physical models.

2.5. Engineered Compositional Space Marching

The graphical method starts by identifying the relationship between the overall composition and the number of phases in the system (see Figure 2 for illustration of the model). Based on a pure excess phases assumption, any composition in two phase lobes is located on tie lines connected to the corners of a ternary diagram. In two-phase regions (refer to the purple cross that represents overall composition), compositions lie on tie lines that connect the vertices of the ternary diagram. The model marches along the purple tie line towards point A as you follow the blue arrow, moving from the two-phase region toward the single-phase region. The composition at the boundary of the two-phase region determines the equilibrium state for a given overall composition, i.e., composition A is the equilibrium microemulsion composition.

For compositions in the single-phase region, referring to the blue cross overall composition, the overall composition is identical to the equilibrium microemulsion phase composition. In the three-phase region, the method marches from the overall composition (green cross) towards two tie lines at the boundary of the two-phase regions (B and C), maintaining a constant oil-to-surfactant and water-to-surfactant ratio. Marching stops at the boundary between the three-phase and two-phase regions, where the oil-to-surfactant and water-to-surfactant ratios are used to define the shape of the tie triangle. From the tie triangle, microemulsion composition is extracted from an invariant point corresponding to the equilibrium. This engineered approach is depicted in Figure 3, which outlines the workflow of the graphical equation-of-state method for identifying phase compositions and boundaries.

2.6. Application of the Graphical Method

The graphical equation-of-state method can be applied to any compositional space mapping, whether generated by a machine learning algorithm or a physics-based model like the modified HLD-NAC. In this study, we used a cubic support vector machine (SVM) model to generate compositional space, demonstrating that the graphical equation-of-state model matches the results of the modified HLD-NAC model and experimental data in complex salinity scans (figure in Section 3.5).

Given the relationship between salinity and other state variables—such as temperature, pressure, and composition, as described by the HLD equation—the graphical equation-of-state model can be extended to any compositional space derived from physical models, machine learning algorithms, hybrid approaches, or experimental data. The ability of the graphical method to accurately identify the phase boundaries and equilibrium compositions in complex surfactant–oil–water systems underscores its applicability to a wide range of systems, as long as compositional relationships are known or can be constrained.

3. Results and Discussion

3.1. Matching Experimental Data to the Modified HLD-NAC Model

The first step in validating the modified HLD-NAC model involves matching experimental solubilization ratios of oil and water to the model predictions across a range of salinity values. The experimental and model-predicted solubilization ratios for oil (σ_o) and water (σ_w) are plotted in Figure 4 as functions of salinity (%wt). The experimental data points are marked as circles, while the continuous lines represent the model’s predictions.

As seen in Figure 4, the model captures the key trends in solubilization behavior as salinity increases. Specifically, the modified HLD-NAC model successfully reproduces the characteristic behavior observed in the experimental data, including the transition between Winsor Type I, Type III, and Type II+ systems. The agreement between the experimental data and model predictions demonstrates the robustness of the model in describing phase behavior under varying salinity conditions.

With this validated physical model, we proceed to generate 5000 data points in the compositional space, varying surfactant, oil, and salinity concentrations. These data points form the basis for training and testing multiple machine learning algorithms, which are discussed in the subsequent sections.

3.2. Comparison of Machine Learning Models with Physical Model

After generating 5000 data points from the modified HLD-NAC model, we trained several machine learning algorithms to predict surfactant–oil–water phase behavior. Reference phase behavior generated based on the physical modified HLD-NAC model is shown in Figure 5. The cubic SVM emerged as the best-performing model, with its predictions shown in Figure 6.

The cubic SVM consistently identified phase regions across various surfactant and oil compositions, particularly in the multiphase areas (two-phase in blue, three-phase in purple). While there were minor errors near phase transition boundaries, the model demonstrated strong performance overall, especially in capturing phase transitions.

Other models demonstrated some limitations, see Figure 6. The medium tree model was too discretized, producing coarse predictions that missed fine details in phase boundaries. The linear SVM model struggled to capture the curvature of the two-phase region, leading to significant inaccuracies in multiphase areas. The fine tree model performed reasonably well but failed to classify phase regions beyond the two-phase boundary, particularly underpredicting the single-phase region. The boosted trees model also indicated several examples of misclassifying the single-phase region. The k-nearest neighbors algorithm performed well.

Overall, the cubic SVM provided the most accurate predictions. The consistency between the SVM model’s predictions and the physical model highlights machine learning’s potential for fast, accurate phase behavior prediction. This approach offers an efficient alternative to traditional physical modeling, reducing the need for extensive computational and experimental resources.

3.3. Prediction Under Varying Salinities Using the SVM Model

After validating the cubic SVM model, we evaluated its performance under varying salinity conditions: under-optimum, optimum, and over-optimum. This allowed us to test its ability to interpolate and predict phase transitions. The results, shown in Figure 7, demonstrate the model’s ability to capture key physical behaviors.

For under-optimum salinity, the SVM accurately identified the shrinking of type II+ region and consistent growing of type II− region. Similarly, for over-optimum conditions, the SVM model accurately predicted the shrinking II− region and growing II+ two-phase region. For optimal conditions, the model correctly predicted middle-phase microemulsion formation and equilibrium states, aligning closely with the physical model. It is worth noting that even though the shrinking of the two-phase region size was small, the model could still identify transitions from three-phase to two-phase to single-phase regions in compositional space.

Figure 8 illustrates the cubic SVM model’s ability to predict phase behavior across varying salinities (y-axis) and surfactant concentrations (x-axis) at a fixed water–oil ratio (WOR = 1). The model captured the expected transitions between Winsor Type I, II, and III regions, with clear phase boundaries. The accurate phase predictions within an interpolative setup demonstrate the model’s robustness and ability to generalize physical behavior from the training data. The SVM model’s smooth interpolation between phase regions under varying conditions underscores its value for practical applications, such as in enhanced oil recovery (EOR) reservoir simulation modeling, where both salinity and surfactant concentrations fluctuate dynamically.

3.4. Machine Learning Model Predictions Beyond Training Data Range

Figure 9 illustrates the extrapolation capabilities of the cubic SVM model beyond the training dataset’s salinity range. As expected, near-range extrapolation shows reasonable accuracy, with the two-phase region growing as salinity conditions deviate from the optimum. The edge of the defined range is at 3%wt. Figure 9a shows the results for near extrapolation for 3.5%wt, whereas Figure 9b shows far extrapolation for salinity 7%wt. As we move further from the trained conditions range, the model begins to exhibit unphysical behavior, such as the re-emergence of a three-phase region in conditions where such behavior is not expected.

These results highlight the limitations of machine learning models when applied outside their trained parameter space. While the SVM model performs well within the interpolated region, its predictions become unreliable in far-extrapolated conditions. This outcome emphasizes that machine learning models should be applied cautiously and not expected to predict accurately beyond the range of the training dataset—in this case, salinity conditions beyond the trained optimum.

Additionally, as shown in Figure 10, reducing the training dataset size from 5000 to 500 data points significantly impacts model performance. With only 500 data points, multiple regions in the compositional space are misclassified, particularly in multiphase areas. This highlights the importance of a sufficiently large dataset for training robust machine learning models. A smaller dataset limits the model’s ability to capture the full complexity of phase transitions, leading to less reliable predictions, especially when extrapolating beyond the trained conditions.

In summary, while the cubic SVM performs well with a large dataset within its trained range, caution must be taken when reducing data size or extrapolating too far beyond the trained conditions, as performance can degrade significantly in such scenarios.

3.5. Graphical Equation-of-State Model vs Experimental and Physical Model Results

The final step in this analysis involves comparing the graphical equation-of-state model (based on the compositional space generated by the SVM machine learning model) with both experimental data and the physical model (modified HLD-NAC). Figure 11 presents the solubilization ratios of oil (σ_o) and water (σ_w) as functions of salinity, where the experimental data are shown alongside the predictions from the graphical model and the physical model.

The results demonstrate that the graphical equation-of-state model captures the key trends observed in the experimental data, particularly in reproducing the transitions between Winsor Type I, III, and II+ microemulsions. Both the oil and water solubilization ratios show a strong alignment with experimental data across the salinity range. The graphical model’s ability to follow these trends closely underscores its robustness and accuracy in modeling surfactant–oil–water systems.

When compared to the physical model, the graphical equation-of-state model (driven by the compositional space mapped by the SVM ML model) performs exceptionally well. This consistency across different methods demonstrates the power of combining machine learning-generated compositional mapping with traditional physical models. The graphical model reproduces not only the equilibrium solubilization ratios but also the critical transitions between single-phase, two-phase, and three-phase regions.

We note here that although the graphical method is powerful, it heavily relies on the quality and abundance of compositional data. Large sets of experimental compositional data are often difficult to measure and acquire. In situations where only sparse experimental data are available, feeding such data into machine learning models to define the compositional space for the graphical model may limit its effectiveness. As such, we suggest employing a hybrid approach, where experimental data are used in conjunction with physical models to generate a broader and more reliable compositional space of interest. This space can then be used to train machine learning models effectively. The graphical method can subsequently be applied using either ML-generated compositional spaces or physical model-based spaces, offering flexibility in different modeling scenarios.

Overall, we have shown that machine learning algorithms can capture complex multiphase behavior and phase transitions in a controlled environment with synthetic physical model data. However, machine learning algorithms alone cannot capture all physical limitations such as mass balance across combined single-, two-, and three-phase regions and the thermodynamic constraints of tie lines and tie triangles. Therefore, in this paper, machine learning was combined with a physically constrained graphical equation-of-state model that incorporates physical constraints and mass balance. Thus, physics-based machine learning and hybrid modeling can be viable solutions for complex systems such as multiphase surfactant–oil–water phase behavior.

4. Conclusions

The primary goal of this study was not to demonstrate the superiority or general applicability of machine learning (ML) models in phase behavior modeling, but rather to assess whether highly non-linear and discretized multiphase behavior in surfactant–oil–water systems can be accurately captured, reproduced, and even reasonably near extrapolated using ML approaches. By focusing on complex transitions, including single-phase, two-phase, and three-phase regions, we aimed to test whether ML models can effectively represent these intricate behaviors. The following main conclusions are drawn from this paper:

We tested four machine learning algorithms that demonstrated varying accuracy: fine tree (97.4%), medium tree (93.4%), linear SVM (90.6%), cubic SVM (99.5%), KNN (98.6%), boosted trees (96.9%). The performance of the algorithms was assessed not only based on the accuracy quantification, but also on the overall consistency of phase identification in multiphase regions. The cubic SVM demonstrated the most promising performance in capturing surfactant–oil–water phase behavior.
The results show that the cubic SVM was able to successfully reproduce key features of the phase behavior, such as the shrinking of two-phase regions as salinity deviates from optimal conditions, and the model performed well in predicting compositional spaces in near extrapolation.
Additionally, this study highlights the robustness and power of the new graphical equation-of-state model, which leverages the compositional space generated by the SVM model or any physical model. The graphical method consistently aligned with both the physical model and experimental data, reinforcing its value as a flexible framework for understanding multiphase behavior in surfactant systems.
This hybrid approach, combining ML-generated compositional spaces with a graphical equation-of-state framework, provides a promising avenue for extending the scope of phase behavior modeling while honoring complexity.
While machine learning in this context was not intended to replace traditional physical models, it demonstrated the capability to capture and generalize these non-linear transitions, even in near-extrapolated scenarios.

In conclusion, this work demonstrates that even for highly non-linear systems like surfactant–oil–water mixtures, machine learning combined with novel graphical equation-of-state models are capable of reproducing and generalizing complex phase behavior. While the study did not seek to establish ML as a superior tool over traditional models, it shows that ML can be effectively integrated into phase behavior studies, offering a complementary and powerful way to investigate intricate multiphase phenomena in challenging compositional spaces. Future research may focus on exploring a wider range of machine learning algorithms and their applicability for a chosen system of interest. Moreover, future studies can incorporate experimental data and associated error bars to investigate machine learning application with sparser data and associated uncertainties.

Author Contributions

Conceptualization, D.M. and M.S.; Methodology, D.M. and M.M.; Formal analysis, D.M.; Investigation, D.M.; Resources, M.M. and Z.D.; Writing—original draft, D.M. and M.M.; Supervision, Z.D. and M.S.; Project administration, M.M., Z.D. and M.S.; Funding acquisition, Z.D. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research has been funded by the Science Committee of the Ministry of Science and Higher Education of the Republic of Kazakhstan (Grant No. AP13068661).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Brantson, E.T.; Binshan, J.; Appau, P.O.; Akwensi, P.H.; Peprah, G.A.; Liu, N.; Aphu, E.S.; Boah, E.A.; Borsah, A.A. Development of hybrid low salinity water polymer flooding numerical reservoir simulator and smart proxy model for chemical enhanced oil recovery (CEOR). J. Pet. Sci. Eng. 2020, 187, 106751, ISSN 0920-4105. [Google Scholar] [CrossRef]
Pope, G.A.; Nelson, R.C. A chemical flooding compositional simulator. Soc. Pet. Eng. J. 1978, 18, 339–354. [Google Scholar] [CrossRef]
Riahinezhad, M.; Romero-Zerón, L.; McManus, N.; Penlidis, A. Evaluating the performance of tailor-made water-soluble copolymers for enhanced oil recovery polymer flooding applications. Fuel 2017, 203, 269–278. [Google Scholar] [CrossRef]
Stoll, W.M.; AL Shureqi, H.; Finol, J.; Al-Harthy, S.A.A.; Oyemade, S.; De Kruijf, A.; Van Wunnik, J.; Arkesteijin, F.; Bouwmeester, R.; Faber, M.J. Alkaline/surfactant/polymer flood: From the laboratory to the field. SPE Reserv. Eval. Eng. 2011, 14, 702–712. [Google Scholar] [CrossRef]
Bourrel, M.; Schechter, R.S. Microemulsions and Related Systems: Formulation, Solvency, and Physical Properties; Editions Technip: Paris, France, 2010. [Google Scholar]
Camilleri, D.; Fil, A.; Pope, G.A.; Rouse, B.A.; Sepehrnoori, K. Comparison of an improved compositional micellar/polymer simu- lator with laboratory core floods. SPE Reserv. Eng. 1987, 2, 441–451. [Google Scholar] [CrossRef]
Lake, L.; Johns, R.T.; Rossen, W.R.; Pope, G.A. Fundamentals of Enhanced Oil Recovery; SPE: Richardson, TX, USA, 2014; Volume 1. [Google Scholar]
Rosen, M.J.; Kunjappu, J.T. Surfactants and Interfacial Phenomena; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Bhosle, M.R.; Joshi, S.A.; Bondle, G.M. An efficient contemporary multicomponent synthesis for the facile access to coumarin-fused new thiazolyl chromeno[4,3-b]quinolones in aqueous micellar medium. J. Heterocycl. Chem. 2020, 57, 456–468. [Google Scholar] [CrossRef]
Naseri, N.; Ajorlou, E.; Asghari, F.; Pilehvar-Soltanahmadi, Y. An update on nanoparticle-based contrast agents in medical imaging. Artif. Cells Nanomed. Biotechnol. 2017, 46, 1111–1121. [Google Scholar] [CrossRef]
Griffin, W.C. Classification of Surface-Active Agents by “HLB”. J. Cosmet. Sci. 1949, 1, 311–326. [Google Scholar]
Winsor, P.A. Hydrotropy, solubilisation and related emulsification processes. Trans. Faraday Soc. 1948, 44, 376–398. [Google Scholar] [CrossRef]
Winsor, P.A. Solvent Properties of Amphiphilic Compounds; Butterworths Scientific Publications: London, UK, 1954. [Google Scholar]
Healy, R.N.; Reed, R.L.; Stenmark, D.G. Multiphase microemulsion systems. Soc. Pet. Eng. J. 1976, 16, 147–160. [Google Scholar] [CrossRef]
Naceur, I.B.; Guettari, M.; Kassab, G.; Tajouri, T. Simple-complex fluid transition in microemulsions. J. Macromol. Sci. Part B 2012, 51, 2171–2182. [Google Scholar] [CrossRef]
Skauge, A.; Fotland, P. Effect of pressure and temperature on the phase behavior of microemulsions. SPE Reserv. Eng. 1990, 5, 601–608. [Google Scholar] [CrossRef]
Bouton, F.; Durand, M.; Nardello-Rataj, V.; Serry, M.; Aubry, J.-M. Classification of terpene oils using the fish diagrams and the Equivalent Alkane Carbon (EACN) scale. Colloids Surf. A Physicochem. Eng. Asp. 2009, 338, 142–147, ISSN 0927-7757. [Google Scholar] [CrossRef]
Wang, S.; Chen, C.; Yuan, N.; Ma, Y.; Ogbonnaya, O.I.; Shiau, B.; Harwell, J.H. Design of extended surfactant-only EOR formulations for an ultrahigh salinity oil field by using hydrophilic lipophilic deviation (HLD) approach: From laboratory screening to simulation. Fuel 2019, 254, 115698, ISSN 0016-2361. [Google Scholar] [CrossRef]
Hand, D.B. Dineric Distribution. J. Phys. Chem. 1930, 34, 1961–2000. [Google Scholar] [CrossRef]
Salager, J.; Morgan, J.; Schechter, R.; Wade, W.; Vasquez, E. Optimum Formulation of Surfactant/Water/Oil Systems for Minimum Interfacial Tension or Phase Behavior. Soc. Pet. Eng. J. 1979, 19, 107–115. [Google Scholar] [CrossRef]
Salager, J.L.; Marquez, N.; Graciaa, A.; Lachaise, J. Partitioning of Ethoxylated Octylphenol Surfactants in Microemulsion−Oil−Water Systems: Influence of Temperature and Relation between Partitioning Coefficient and Physicochemical Formulation. Langmuir 2000, 16, 5534–5539. [Google Scholar] [CrossRef]
Salager, J.L.; Antón, R.E.; Briceño, M.I.; Choplin, L.; Màrquez, L.; Pizzino, A.; Ro-Driguez, M.P. The emergence of formulation engineering in emulsion making—Transferring know-how from research laboratory to plant. Polym. Int. 2003, 52, 471–478. [Google Scholar] [CrossRef]
Acosta, E.; Szekeres, E.; Sabatini, D.A.; Harwell, J.H. Net-Average Curvature Model for Solubilization and Supersolubilization in Surfactant Microemulsions. Langmuir 2003, 19, 186–195. [Google Scholar] [CrossRef]
Ghosh, S.; Johns, R.T. An Equation-of-State Model To Predict Surfactant/Oil/Brine-Phase Behavior. SPE J. 2016, 21, 1106–1125. [Google Scholar] [CrossRef]
Khorsandi, S.; Johns, R.T. Robust Flash Calculation Algorithm for Microemulsion Phase Behavior. J. Surfactants Deterg. 2016, 19, 1273–1287. [Google Scholar] [CrossRef]
Magzymov, D.; Johns, R.T. Inclusion of variable characteristic length in microemulsion flash calculations. Comput. Geosci. 2022, 26, 995–1010. [Google Scholar] [CrossRef]
Gaganis, V.; Varotsis, N. Non-iterative phase stability calculations for process simulation using discriminating functions. Fluid Phase Equilibria 2012, 314, 69–77. [Google Scholar] [CrossRef]
Li, Y.; Zhang, T.; Sun, S. Acceleration of the NVT flash calculation for multicomponent mixtures using deep neural network models. Ind. Eng. Chem. Res. 2019, 58, 12312–12322. [Google Scholar] [CrossRef]
Zhang, T.; Li, Y.; Li, Y.; Sun, S.; Gao, X. A self-adaptive deep learning algorithm for accelerating multi-component flash calculation. Comput. Methods Appl. Mech. Eng. 2020, 369, 113207. [Google Scholar] [CrossRef]
Ihunde, T.A.; Olorode, O. Application of physics informed neural networks to compositional modeling. J. Pet. Sci. Eng. 2022, 211, 110175. [Google Scholar] [CrossRef]
Kashinath, A.; Szulczewski, M.; Dogru, A. A fast algorithm for calculating isothermal phase behavior using machine learning. Fluid Phase Equilibria 2018, 465, 73–82. [Google Scholar] [CrossRef]
Peacock, C.J.; Lamont, C.; Sheen, D.A.; Shen, V.K.; Kreplak, L.; Frampton, J.P. Predicting the Mixing Behavior of Aqueous Solutions Using a Machine Learning Framework. ACS Appl. Mater. Interfaces 2021, 13, 11449–11460. [Google Scholar] [CrossRef]
Tung, C.H.; Chang, S.Y.; Chang, M.C.; Carrillo, J.M.; Sumpter, B.G.; Do, C.; Chen, W.R. Inferring Colloidal Interaction from Scattering by Machine Learning. Carbon Trends 2023, 10, 100252. [Google Scholar] [CrossRef]
Magzymov, D.; Makhatova, M.; Dairov, Z.; Syzdykov, M. Evaluation of Machine Learning Applications for the Complex Near-Critical Phase Behavior Modelling of CO₂—Hydrocarbon Systems. Appl. Sci. 2024, 14, 11140. [Google Scholar] [CrossRef]
Furth, N.R.; Imel, A.E.; Zawodzinski, T.A. Comparison of Machine Learning Approaches for Prediction of the Equivalent Alkane Carbon Number for Microemulsions Based on Molecular Properties. J. Phys. Chem. A 2024, 128, 6763–6773. [Google Scholar] [CrossRef] [PubMed]
Bell, G. Non-Ionic Surfactant Phase Diagram Prediction by Recursive Partitioning. Philos. Trans. R. Soc. A 2016, 374, 20150137. [Google Scholar] [CrossRef] [PubMed]
Thacker, J.C.; Bray, D.J.; Warren, P.B.; Anderson, R.L. Can machine learning predict the phase behavior of surfactants? J. Phys. Chem. B 2023, 127, 3711–3727. [Google Scholar] [CrossRef]
Seddon, D.; Müller, E.A.; Cabral, J.T. Machine Learning Hybrid Approach for the Prediction of Surface Tension Profiles of Hydrocarbon Surfactants in Aqueous Solution. J. Colloid Interface Sci. 2022, 625, 328–339. [Google Scholar] [CrossRef]
Rashidi-Khaniabadi, A.; Rashidi-Khaniabadi, E.; Amiri-Ramsheh, B.; Mohammadi, M.R.; Hemmati-Sarapardeh, A. Modeling interfacial tension of surfactant–hydrocarbon systems using robust tree-based machine learning algorithms. Sci. Rep. 2023, 13, 10836. [Google Scholar] [CrossRef]
Talapatra, A.; Nojabaei, B.; Khodaparast, P. A Data-Based Continuous and Predictive Viscosity Model for the Oil-Surfactant-Brine Microemulsion Phase. In Proceedings of the SPE Improved Oil Recovery Conference, Tulsa, OK, USA, 23–25 April 2024. [Google Scholar] [CrossRef]
Zhang, G.; Yu, J.; Du, C.; Lee, R. Formulation of surfactants for very low/high salinity surfactant flooding without alkali. In Proceedings of the SPE International Symposium on Oilfield Chemistry, The Woodlands, TX, USA, 13–15 April 2015. [Google Scholar] [CrossRef]

Figure 1. Illustration of input parameter range and model output types for machine learning algorithms.

Figure 2. Illustration of graphical equation-of-state model. Blue cross overall composition is in single-phase region, purple cross is in two-phase equilibrium of microemulsion A and oil, and green cross is in three-phase region where equilibrium is defined by B and C tie-lines.

Figure 3. Graphical equation-of-state model workflow.

Figure 4. Physical model performance compared to experimental data, salinity scan. Data from Zhang et al. (2015) [41].

Figure 5. Reference phase behavior based on physical model, to be used for machine learning comparison.

Figure 6. Machine learning phase behavior predictions (accuracy): fine tree (97.4%), medium tree (93.4%), linear SVM (90.6%), cubic SVM (99.5%), KNN (98.6%), boosted trees (96.9%).

Figure 7. Phase behavior predicted with machine learning and graphical equation-of-state, ternary diagram at varying salinities: (a) 0.85%wt, (b) 1.15%wt, (c) 1.5%wt.

Figure 8. Fixed water–oil ratio compositional space, fish plot. HLD = ln(S/Sopt), Sopt = 1.15%wt.

Figure 9. Machine learning model extrapolation: (a) near extrapolation at 3.5%wt (3%wt is the upper range for tuned data); (b) far extrapolation at 7%wt (3%wt is the upper range for tuned data).

Figure 10. Illustration of less accuracy with sparse data available for SVM training.

Figure 11. Graphical equation-of-state model performance compared to experimental data and physical model, salinity scan. Data from Zhang et al. (2015) [41].

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Magzymov, D.; Makhatova, M.; Dairov, Z.; Syzdykov, M. Evaluation of Machine Learning Assisted Phase Behavior Modelling of Surfactant–Oil–Water Systems. Appl. Sci. 2025, 15, 100. https://doi.org/10.3390/app15010100

AMA Style

Magzymov D, Makhatova M, Dairov Z, Syzdykov M. Evaluation of Machine Learning Assisted Phase Behavior Modelling of Surfactant–Oil–Water Systems. Applied Sciences. 2025; 15(1):100. https://doi.org/10.3390/app15010100

Chicago/Turabian Style

Magzymov, Daulet, Meruyert Makhatova, Zhassulan Dairov, and Murat Syzdykov. 2025. "Evaluation of Machine Learning Assisted Phase Behavior Modelling of Surfactant–Oil–Water Systems" Applied Sciences 15, no. 1: 100. https://doi.org/10.3390/app15010100

APA Style

Magzymov, D., Makhatova, M., Dairov, Z., & Syzdykov, M. (2025). Evaluation of Machine Learning Assisted Phase Behavior Modelling of Surfactant–Oil–Water Systems. Applied Sciences, 15(1), 100. https://doi.org/10.3390/app15010100

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu