[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Next Article in Journal
Species-Specific Spillover Patterns Detected by Biomass Gradients in Mediterranean Marine Protected Areas
Previous Article in Journal
Decoding Consumer Minds in the Age of Online Accommodation Reviews: A Client Profiling Approach
Previous Article in Special Issue
Linking Meteorological Variables and Particulate Matter PM2.5 in the Aburrá Valley, Colombia
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Assessing the Role of Machine Learning in Climate Research Publications

by
Andreea-Mihaela Niculae
1,2,*,
Simona-Vasilica Oprea
1,
Alin-Gabriel Văduva
1,2,
Adela Bâra
1 and
Anca-Ioana Andreescu
1
1
Department of Economic Informatics and Cybernetics, Bucharest University of Economic Studies, 010374 Bucharest, Romania
2
Doctoral School of Economic Informatics, Bucharest University of Economic Studies, 010374 Bucharest, Romania
*
Author to whom correspondence should be addressed.
Sustainability 2024, 16(24), 11086; https://doi.org/10.3390/su162411086
Submission received: 1 November 2024 / Revised: 14 December 2024 / Accepted: 16 December 2024 / Published: 18 December 2024
(This article belongs to the Special Issue Air Pollution Management and Environment Research)

Abstract

:
Climate change is an aspect in our lives that presents urgent challenges requiring innovative approaches and collaborative efforts across diverse fields. Our research investigates the growth and thematic structure of the intersection between climate change research and machine learning (ML). Employing a mixed-methods approach, we analyzed 7521 open-access publications from the Web of Science Core Collection (2004–2024), leveraging both R and Python for data processing and advanced statistical analysis. The results reveal a striking 37.39% annual growth in publications, indicating the rapidly expanding and increasingly significant role of ML in climate research. This growth is accompanied by increased international collaborations, highlighting a global effort to address this urgent challenge. Our approach integrates bibliometrics, text mining (including word clouds, knowledge graphs with Node2Vec and K-Means, factorial analysis, thematic map, and topic modeling via Latent Dirichlet Allocation (LDA)), and visualization techniques to uncover key trends and themes. Thematic analysis using LDA revealed seven key topic areas, reflecting the multidisciplinary nature of this research field: hydrology, agriculture, biodiversity, forestry, oceanography, forecasts, and models. These findings contribute to an in-depth understanding of this rapidly evolving area and inform future research directions and resource allocation strategies by identifying both established and emerging research themes along with areas requiring further investigation.

1. Introduction

Climate change is one of the most pressing global challenges, affecting nearly every aspect of life. Its impact spans agriculture, ecosystems, health, and even businesses, emphasizing the urgent need for innovative solutions [1,2]. Defined as long-term changes in global weather patterns, climate change is primarily driven by rising temperatures [3,4], which have led to more frequent extreme weather events and significant biodiversity loss. Addressing these issues requires immediate and coordinated action by policymakers and communities to adopt sustainable practices.
The Intergovernmental Panel on Climate Change (IPCC) has highlighted these concerns in reports like the Special Report on Climate Change and Land and the Global Warming of 1.5 °C report. These call for ambitious measures to limit global warming to 1.5 °C above pre-industrial levels by 2100 [5]. In response, multiple countries have already started to adopt such policies. The United States has created a National Climate Task Force with diverse sustainable objectives [6]. China has followed suit, focusing on reducing greenhouse gas emissions [7]. The United Kingdom is committed to achieving net-zero emissions by 2050 [8], with other nations also taking action. However, these measures are just the beginning, and achieving these goals requires accurate analysis and informed decision making, both of which rely heavily on robust climate data.
Traditional methods of analyzing climate data often struggle to manage the complexity, scale, and heterogeneity of these datasets. Climate systems involve complex interconnections across temporal and spatial scales, requiring advanced tools capable of uncovering patterns and relationships. These challenges limit the ability to make precise predictions, assess risks, and devise effective strategies. Machine learning (ML) has emerged as a promising solution [9], offering tools capable of processing vast, diverse datasets, extracting meaningful insights, and enabling actionable predictions. Early developments in ML have demonstrated its utility in applications such as predictive modeling, risk assessment, and optimizing resource management for climate adaptation. Despite this progress, the integration of ML into climate research is still in its early stages, leaving significant potential unexplored [10]; thus, key gaps remain in climate ML research. There is limited understanding of ML’s historical growth in climate studies or the geographical leaders driving advancements. Collaboration dynamics and the benefits of combining bibliometric, text mining, and visualization methods are underexplored. Studies lack clarity on key topics, diverse ML approaches, and the specific domains applying ML. Addressing these gaps can enhance understanding, foster collaboration, and guide impactful applications.
Recognizing the urgency of climate action, this paper explores publications that integrate climate-related themes with ML techniques, aiming to strengthen our understanding of their interconnections and applications. By examining this intersection and analyzing some of the identified gaps our goal is to address several research questions (RQs):
RQ1: To what extent has the integration of ML in climate research grown over time?
RQ2: Which countries have made the most contribution in this domain?
RQ3: What is the nature of collaboration in climate change and ML research?
RQ4: How do bibliometric indicators, text mining methods, and visualization techniques complement each other, and what hidden themes can be identified through their combination?
RQ5: What key topics and sub-themes can be identified within this research area?
RQ6: What diverse approaches and methodologies are explored using ML in climate research?
RQ7: Which domains connected to climate research utilize the most ML methods?
Each research question has been considered to align with the specific knowledge gaps identified in our introduction. For instance, RQ1 seeks to quantify the integration of machine learning (ML) in climate research over time. RQ2 and RQ3 explore geographical contributions and collaboration dynamics, which are important for fostering knowledge exchange and enhancing collective efforts in climate adaptation. The inclusion of RQ4 through RQ7 emphasizes the importance of interdisciplinary approaches, examining how bibliometric indicators, text mining, and visualization techniques complement each other. This addresses the gap regarding the underutilization of combined methods in climate ML research.
By addressing these questions, we highlight our contribution by integrating multiple data analysis techniques combining different specialized programming languages, specifically, R and Python. These techniques include bibliometrics, text mining (including word clouds, knowledge graph with Node2Vec and K-Means, and topic modeling using Latent Dirichlet Allocation (LDA)), as well as various data visualizations methods. Our aim is to provide a comprehensive analysis of the existing literature on climate change and ML. By combining these approaches, we offer an enhanced understanding of the field’s thematic landscape and its evolution, a perspective that would not be attainable through individual methods alone. This approach also serves as a replicable model for researchers to explore other interdisciplinary fields.
The research’s thematic insights are particularly noteworthy. Using LDA topic modeling, the researchers identified seven key areas central to the domain: (1) methodological approaches and energy systems, (2) predictive modeling and performance analysis, (3) forest mapping and land classification, (4) agricultural and soil observations, (5) oceanic and atmospheric studies, (6) hydrological and climate impacts, and (7) species distribution and environmental changes.
Another significant contribution lies in the research’s methodological framework. By employing a mixed-methods approach that integrates bibliometrics, text mining, and advanced statistical analysis, the research provides a comprehensive view of the thematic and collaborative landscape in this domain. Tools such as knowledge graphs generated using Node2Vec and K-Means and topic modeling with LDA uncover patterns, connections, and insights that might otherwise remain obscured. The methodology demonstrates the utility of combining diverse analytical tools to gain a deeper understanding of the field.
The research also leverages knowledge graphs to identify thematic clusters, including ecosystem analysis using ML, water management and predictive analytics, climate impact modeling and prediction, and climate change modeling and classification. These findings emphasize the diverse applications of ML in addressing various aspects of climate change, from localized issues to global-scale modeling efforts.
The breadth of applications revealed by this study is a relevant, critical contribution. By addressing the domains connected to ML and climate research, the findings underline the versatility of ML in tackling challenges ranging from hydrology and biodiversity to agriculture, oceanography, and predictive modeling. Our research paper is organized into several sections: an introduction, which underlines the significance of climate change, its urgency, our research questions, and contributions; a literature review that discusses the latest publications in the field of bibliometrics, climate, and ML research; a methods and data section detailing the data extraction process and the methodologies used for analysis; a results section showcasing the various techniques employed and the obtained key findings; and finally, a conclusion that summarizes the study’s contributions.

2. Literature Review

The study of climate change has grown rapidly since it became a formal research field in the 1980s, highlighted by numerous bibliometric analyses that quantify this growth and examine thematic trends across publications [11,12], such as climate modeling, oceanic studies, and atmospheric sciences, predominantly within the Natural Sciences [12]. The number of climate-related studies has been doubling every four years, showing the fast pace of the field’s expansion [13,14].
Despite this rich body of work, there are still gaps in how artificial intelligence (AI) is used in climate research. ML, a subset of AI, has been widely applied to various climate aspects, addressing both short-term and long-term predictions. However, other useful AI tools, such as expert systems, decision trees, or optimization algorithms, are not well integrated into climate studies [15]. AI has been applied in areas like energy, transport, and city planning [15,16], but there is still room to explore its full potential and improve its integration into these fields. Advancements in AI technologies create opportunities to develop and improve climate models, highlighting the need for closer collaboration between AI experts and climate scientists [16,17,18,19].
Additionally, studies have shown that ML models are effectively applied in both short-term predictions, covering areas such as agriculture, energy, and disaster management, and medium- to long-term predictions in fields such as urban planning, insurance, and public health [19]. Short-term climate predictions use algorithms like neural networks and memory models, while medium- and long-term predictions rely on methods such as K-Nearest Neighbors and Random Fields [19]. While ML models effectively address short- and long-term climate predictions, there is limited research on how to integrate insights from these timeframes to assess cumulative impacts. For instance, understanding how short-term changes in agriculture could cascade into long-term effects on urban planning or public health remains underexplored.
While the previous findings focus on modeling and predicting climate-related aspects using ML, there are other ML models that assess the perception and implications of climate events using natural language processing (NLP) methods to analyze bodies of text and discover underlying themes. One notable NLP technique is the LDA algorithm, which can be applied to bag-of-words representations to assess public perceptions regarding climate events [20,21] and media coverage [22,23], uncovering a mix of views and key topics. LDA can also facilitate bibliometric methods, such as analyzing abstracts [24,25,26,27,28], to reveal common ideas and themes in climate-related studies and publications.
For instance, research employing LDA on Twitter datasets identified eight topics categorized by sentiments: news (media, accurate information), support, neutral, and anti, revealing a growing public awareness of global warming as a human-made issue and a desire for more substantial government action to address climate change [20]. Similarly, an analysis of news media in Pakistan using LDA identified several emerging themes, including climate governance and policy, climate change impacts, societal contributions to climate efforts, climate politics, climate science, and climate solutions [22]. Lastly, a study utilizing LDA on abstracts extracted via the Scopus API identified 14 key topics related to climate change and sustainability, such as sustainable food, green consumers, and sustainable consumption. This research indicated that existing marketing principles related to climate change are inadequately articulated, although they are more effectively communicated through the concept of sustainability. This analysis also noted a peak in citations around 2016, suggesting a possible decline in the readers’ interest in the subject since then [24].
The link between climate change and social or political factors is another area of interest. Research highlights how significant milestones such as the IPCC AR4 climate change report and the Sustainable Development Goals (SDGs) have shaped research priorities, but how these shifts happened has not been fully studied [29,30].
This study contributes by using bibliometric analysis to track and explain research shifts influenced by significant socio-political events like the SDGs. It offers a novel integration of AI techniques that address multi-sectoral challenges in climate science, ultimately filling the identified gaps in existing methodologies and providing insights for researchers and policymakers alike.

3. Methods and Data

Our study utilizes data gathered from the Web of Science Core Collection database. We selected this database for its comprehensive and high-quality coverage of scholarly literature [31], as it provides coverage of more than 35,000 journals.
We downloaded the desired bibliometric data for our analysis using the query constructed as specified in Equation (1), essential for reducing the results to only the most relevant ones. The query was performed manually via the Web of Science interface, without employing APIs or web-crawling scripts. This decision was made to leverage the platform’s user-friendly search functionality, simplifying access for general users.
The abbreviations are AF for All Fields, DT for Document Types, OA for Open Access, LA for Language, RN for Retraction Notices, and PY for Publication Year. The primary focus of the query involves combining the term “climate” with any other word alongside “machine learning”, across all fields. The timeframe is set to all years excluding 2025, as documents from that year have not yet been published. For our study we specifically selected only open-access documents to ensure easy access to their content when needed.
To validate our search terms, we conducted a preliminary review of the literature to ensure that the keywords used reflected current trends and language commonly found in climate and machine learning research. This process involved reviewing recent publications and documents that employed similar search criteria [11,12,13] to establish the relevance of our terms. While OA articles represent a subset of all publications, studies suggest that they frequently cover high-impact research and are representative of key trends in the field. Additionally, the growing shift toward OA publication mitigates concerns about coverage limitations [32]. However, we acknowledge the potential limitations of this approach.
A F = c l i m a t e   * ˄ A F = m a c h i n e   l e a r n i n g ˅ A F = M L ˄ D T = A r t i c l e ˄ O A = T r u e ˄ L A = E n g l i s h ˄ ¬ R N = R e t r a c t e d   P u b l i c a t i o n ˄ ¬ P Y = 2025
where * means any word following climate; means AND; means OR; ¬ means NOT.
By utilizing the query in the Web of Science portal, a total of 7519 documents were extracted as of 10 October 2024, representing journal articles and conference papers containing bibliometric metadata such as titles, abstracts, publication years, and citation counts. In this study, documents refers to metadata records, not full text. Full-text analysis was excluded, as the focus was on bibliometric patterns rather than content analysis. OA documents were prioritized for accessibility and reproducibility. Although OA articles represent a subset of all publications, they frequently include high-impact research, ensuring the sample reflects broader trends in the field.
The bibliometric information was downloaded and saved either as CSV files for processing in Python, or as BibTeX files for analysis in R, specifically using the Bibliometrics package. Figure 1 showcases the complete process for extracting the required data. A key step is verifying that all requirements are met, achieved through a systematic review of the query results. This involved (1) reviewing research objectives, (2) creating a checklist of expected outcomes (e.g., publication type, date range, keyword relevance), (3) assessing retrieved articles against these criteria, and (4) refining the query iteratively until all requirements were satisfied. For instance, we confirmed that all retrieved articles were journal publications within the specified date range, included relevant keywords, and contained complete bibliometric metadata. This validation ensured the data were ready for subsequent processing.
Before analyzing the datasets, we introduced an additional step to ensure data cleanliness [33], enhancing the reliability of the following analyses. This step comprised several actions. First, we removed duplicates to ensure that each document was represented only once in our dataset, as the human error from manual download can create multiple entries. Then, we standardized the formats of key fields such as author names, publication years, and titles, thus fixing inconsistencies. The data were assessed for missing values and, where appropriate, a decision had to be made whether to fill these gaps with relevant information or remove incomplete records that could compromise the analysis. Redundancies, data irrelevant to our research questions, were eliminated by discarding columns, focusing on essential bibliometric data. Finally, validating the data is important, and this was achieved by cross-checking a sample of the entries against the original source in the Web of Science database to verify the accuracy of the bibliometric information.
For this paper, we utilized multiple analysis methods for data exploration, visualization, and mining. The initial step involves exploratory data analysis complemented by bibliometric methods. We employed visualizations, including knowledge graphs, factorial analysis, and thematic maps, to present insights. To further understand the complexity of the extracted data, we implemented data mining techniques such as Node2Vec with the K-Means clustering and LDA.
We used two different programming languages specializing in data analytics, R (version 4.4.1.) and Python (version 3.12). The integrated use of R and Python presents several key advantages. Python is flexible and powerful at handling data, essential aspects for data cleaning and preparation. R’s comprehensive bibliometrics package, Bibliometrix, offers specialized functions for in-depth analysis, visualization, and reporting. Bibliometrix automatically detects countries from authors’ affiliations, which was needed to plot maps using Python. Their combined approach maximized both the efficiency and analytical power of the workflow, aiding in obtaining more complete insights.
This combination of R and Python sets the stage for our analysis, allowing for a seamless integration of various data processing and visualization methods: exploratory data analysis, exploratory factorial analysis, bibliometrics, knowledge graph with Node2Vec, K-Means, and Latent Dirichlet Allocation.

3.1. Exploratory Data Analysis (EDA)

The initial step in analyzing any dataset [34] is to discover its characteristics, to develop a general understanding of two key aspects: the data that will be used for future analysis and the type of cleaning required to prepare it for effective use in those analyses. EDA is an essential component of data pre-processing as it presents the data in an easily understandable format.
EDA is a key step in research analysis [35]. EDA is the methodology of studying the dataset to uncover characteristics and possible areas that need to be cleaned [36], an essential step before performing analyses. This methodology can be seen in Figure 2, and it mainly contains the techniques needed to obtain descriptive statistics, basic plots (such as histograms, scatter plots, boxplots), and basic tables (such as frequency, contingency), all of which are needed to understand underlying aspects of the dataset, by exploring the dataset and identifying patterns, relationships, and deviations. Unlike data mining, EDA focuses on the meaning and interpretation of data and can be applied to all variables. Beyond basic analysis, EDA can also visualize high-dimensional data [35] through data reduction methods like Principal Component Analysis and factor analysis. EDA helps with decision making for data cleaning and pre-processing by revealing inconsistencies, missing values, and outliers. Visualizing patterns and distributions assists in identifying necessary data processing steps and determining the best strategies to ensure data integrity and quality for further analyses.

3.2. Exploratory Factorial Analysis (EFA)

When dealing with datasets consisting of multiple variables, reducing dimensionality can help uncover underlying relationships and patterns, facilitating data interpretation [38]. One method for this is exploratory factorial analysis (EFA), a component of EDA. EFA identifies common and unique factors influencing variables, grouping them based on shared variance. Mathematically, EFA computes eigenvalues and eigenvectors to capture the variance explained by each factor. Equation (2) [39] presents the exploratory factor analysis model, where x is a vector for the observed responses (actual values), f is a vector of common factors, u is a vector of unique factors, and the matrix is a matrix of factor loadings.
x = μ + f + u
These calculations are based on minimizing common variance, with each factor accounting for a portion of the total variance. Eigenvalues and eigenvectors are determined for each variable to examine how they can be expressed as linear combinations of common factors. The objective is to remove as much common variance as possible, with each factor explaining a certain percentage of the variable’s variance. Ultimately, only the common factors are retained, while the remaining values help with understanding the intensity of each factor within every variable, which is useful in naming these factors.
While there is no strict threshold for the percentage of variance that must be explained by EFA, a higher percentage indicates a stronger and more reliable factor structure. Generally, explaining a higher portion of variance is desirable [40], as it suggests that the factors adequately capture the underlying relationship in the data. Additionally, for the factor analysis to be considered robust, the Measure of Sampling Adequacy, a component of the Kaiser–Meyer–Olkin test, should exceed 50% for each variable.

3.3. Bibliometrics

Bibliometrics, or the quantitative study of books, media, scientific disciplines, and scholarly communication, is one of the three metrics (bibliometrics, Scientometrics, and Informetrics) [41] used to gain insights into research fields and publications. For this study, we employed Biblioshiny [42] in R for visualization, Pandas in Python for metadata processing, and LDA-specialized packages for text mining used for content analysis to conduct our bibliometric analysis, which is illustrated in Figure 3. We extracted the metadata from scientific databases through manual downloads, which included information about authors, affiliations, countries, keywords, abstracts, citations, and citation counts.
Key areas of focus in the bibliometric analysis included research trends, geographical insights, collaboration networks, and publication metrics. For keyword visualization, we used factorial keyword maps, word clouds, and knowledge graphs, while text mining (e.g., bigram word clouds and LDA) was applied to analyze abstracts in greater depth. These methods provided a comprehensive view of the research landscape and dynamics within the field [41]. As a dedicated research field [43], bibliometrics provides useful information across multiple publications, increasing the understanding of various research subjects and dynamics.

3.4. Knowledge Graph and Node2Vec

A powerful tool for relationship visualization [44], a knowledge graph provides useful information into underlying themes and links within a dataset. Often referred to as a semantic network, it illustrates real-world entities and the connections, including their intensity, between them. Visually depicted with nodes and edges, a knowledge graph can be constructed using ML techniques, with or without natural language processing (NLP), to create a comprehensive representation of nodes, edges, and labels. Observing a knowledge graph can reveal hidden themes and connections within data, making it useful in applications like recommendation systems, data integration, and enhancing information retrieval.
One such NLP ML technique for creating a knowledge graph is Node2Vec [45], an unsupervised learning method. It uses node embedding principles to create the vector representation for graphs and relational structures. This algorithm deploys random walks through a graph, sampling the neighborhood of nodes and generating embeddings to preserve the network structures. In the end, each node is considered a token (a unique word in a dictionary), with its unique vector representation capturing the nuances of its connections and relationships within the graph. Node2Vec uses tokenization principles. Mathematically, the Node2Vec essence is captured in Equation (3), where the objective is to maximize the logarithmic probability of observing a network neighborhood N s ( u ) for node u , based on its representation in the embedded space, f u . P r ( N s ( u ) | f u ) is the probability of observing neighborhood nodes.
max u V log P r ( N s ( u ) | f u )

3.5. K-Means

Grouping data is essential for understanding datasets, especially when dealing with high-volume data. Being one of the most powerful and widely used data mining algorithms [46], K-Means is a clustering algorithm that utilizes the concepts of distance and centroids to form k-given clusters, iteratively refining them to minimize the distance between data points and their respective centroids. Grouping data together aids in exploring theme-related aspects and understanding how data elements interact to reveal hidden patterns. The effective implementation of K-Means requires careful consideration of the appropriate number of clusters (k) to ensure meaningful results. Starting from a dataset X = X i , where i = 1 , n ¯ , the data are then partitioned into ‘k’ clusters C = { C j } , j = 1 , k ¯ . The core concept of K-Means is the calculation of the centroids μ k , which represent the mean (center) of the data points in each cluster. The objective of this algorithm is to identify the ‘k’ clusters by minimizing the cost function from Equation (4).
k = 1 K x i C k x i μ k 2

3.6. Latent Dirichlet Allocation

Bibliometrics techniques contain, as mentioned before, text mining algorithms. One of the most powerful techniques for text mining is topic modeling, which reveals hidden patterns and relationships in large datasets. One of the most popular methods in the topic modeling domain is Latent Dirichlet Allocation (LDA), introduced in 2003 [47]. Subsequent work, such as [48], expanded on its methodology, particularly in inference techniques. LDA is an unsupervised generative probabilistic model of a corpus [49]. Given a bag of words (corpus), LDA identifies topics in the text data by constructing a probabilistic model that clusters words. It associates each word with a probability to be a part of a certain topic or not, assuming that documents comprise a multitude of topics and topics are amalgams of words [50].
An important parameter when deploying LDA models is the selection of the number of topics. While coherence scores are often employed as one metric for evaluating topic quality, it is essential to note that topic model evaluation is an ongoing area of research with no universally accepted measures for comparison between models [51]. Coherence scores assess the degree of semantic similarity between words within a topic, but they are primarily useful for evaluating individual topics rather than making direct comparisons across different models [52]. In practice, coherence is calculated by comparing the top N words of a topic and analyzing their co-occurrences in the corpus, with higher coherence scores generally indicating better topic quality. The coherence score presented in this paper is the c_v score [52], which measures how well the most frequent words within a topic are related based on their co-occurrence in the dataset. This score provides a reliable balance between interpretability and the ability to identify meaningful patterns in the data. The c_v score calculations combine statistical measures, such as normalized pointwise mutual information (NPMI), to produce a coherence score that reflects the strength of connections between the top words.
The LDA method is useful in analyzing large-scale text data, uncovering latent patterns, and improving information retrieval processes. In order to assign topics to words, LDA uses an algorithm known as the Gibbs sampling [48], whose formula is in Equation (5). The first ratio is the probability of topic t to be in document d . The second radio is the probability that word w belongs to topic t . The algorithm calculates the probabilities in an iterative process: for the first ratio, it computes the number of words in document d that belong to topic t ; for the second ratio, it computes the portion of occurrences of w in t .
P Z i = t Z i , w = n m , t i + α t = 1 T ( n m , t i + α ) × n t , w i i + β v = 1 V ( n t , v i + β )

4. Results

4.1. Exploratory Data Analysis

After we obtained the bibliometric dataset for articles containing information related to climate and ML, we started the data cleaning process where we observed a large number of variables. After conducting descriptive statistics to search for missing data, we discovered that while most variables contained complete observations, some variables had missing information. Specifically, the Highly Cited Status and Hot Paper Status had no entries; none of the papers returned by the query were marked as either highly cited or hot papers. This finding suggests that research in this domain may not be as prominent as in others. It could indicate that ML has not yet become a significant factor in climate research. Given that there are multiple cases where the media has discarded the social, economic, technological, and local aspects of climate change research [53] by electing to publish only certain results, this can highly impact the quality of research performed in this field. Since we were unable to infer information about these variables, they were eliminated from the research. Missing years were filled with the mean publication year; citations with 0 (suggesting no data found), and missing texts were filled with “Missing Data”.
To ensure that each document was represented only once, we identified and removed duplicates based on key fields such as title and DOI, minimizing human error from the manual download process. Variables with more than 50% missing information were also discarded, thus removing redundancies in our data. We standardized the formats of key text fields, ensuring consistent capitalization. Finally, we manually validated data by randomly selecting entries and performing look-ups to verify the accuracy of the bibliometric information. In the end, no document was removed. We have created a GitHub repository to share the data and code used in this study, available at https://github.com/andreeaniculaecsie/sust2024-climate-ml (accessed on 28 November 2024).
Table 1 provides general information regarding the dataset, obtained using the Bibliometrix package in R [42]. Although the timeframe was set to “all years”, the table reveals that the first article combining climate research and ML was published in 2004, which is relatively recent. In 2004, the World Business Council for Sustainable Development published their goals for 2030 for a more sustainable life, where they categorized and highlighted climate change as a real risk for society [54]. We extracted a total of 7521 articles from over 1300 sources. The annual growth rate sits at an impressive percentage of 37.39%, reflecting a notable increase in publications within this field, which suggests a growing interest in the intersection of climate research and ML.
The average document age, just 2.11 years, indicates that, on average, the retrieved articles in this dataset are relatively recent, which helps our study with highlighting and presenting current trends and developments in the domain. The average number of citations per document is nearly 18, signifying that each document is relevant within the field. The average number of citations per year per document is 4.152, reflecting ongoing engagement and relevance, with the research intersecting these two areas over time.
In Figure 4, we showcase the increasing number of publications and the variability of average citations per year from the beginning of the period up to the current year. Starting around 2015, there is a noticeable rise in publications, with significant growth peaking in recent years, particularly from 2018. By 2023, the number of publications had nearly tripled compared with 2020, indicating the rising interest in this research area. In 2023 alone, nearly a quarter of the total articles were published. The average citations per year display a spike around 2013, suggesting the presence of possible high-impact publications during that time. The exact numbers for citations in recent years are unclear, as articles continue to be published. The simultaneous increase in both publications and citations in recent years, especially from 2018, suggests that recent studies are gaining more attention, likely due to the topic’s growing significance and the increased awareness as a result of the development of the SDGs [29].
Next, we find it insightful to examine which countries have produced the most articles in this domain as well as the nature of their collaborative efforts. Table 2 lists the top 10 countries (based on the article’s corresponding author) with the highest number of published articles in this research field. It includes figures for publications solely from the author’s country and those involving international collaboration. A high MCP Ratio indicates that the majority of that country’s publications involve international collaboration. Overall, the USA, China, and Germany are the top contributors in terms of article volume. The United Kingdom and Australia have the highest collaboration rates, indicating the existence of global partnerships. With most ratios above 30%, it is evident that countries are increasingly working together to advance research in climate and ML, reflecting the global nature of these challenges. Computing separately the number of articles written by each country, regardless of the corresponding author’s origin, the top 30 countries based on the number of publications are the USA, China, Germany, the United Kingdom, Australia, Canada, Italy, Spain, India, France, the Netherlands, Republic of Korea, Switzerland, Sweden, Iran, Saudi Arabia, Japan, Austria, Denmark, Finland, Vietnam, Belgium, South Africa, Portugal, Poland, Egypt, and Greece. Using Figure 5, we observe best the difference in the number of publications around the globe, as well as the top 50 countries involved in researching this domain.
We employed a collaboration network highlighting the global partnerships in climate and ML research. The countries are divided into two main clusters, yet the multiple lines between them reveal a dense web of partnership connections, making it somewhat hard to follow. The first cluster is centered around China and includes many Asian countries, as well as European nations like Romania and Hungary, indicating robust collaboration. The second cluster is dominated by the USA, followed by the European countries, particularly by Germany and the United Kingdom. The USA and China collaborate in multiple articles, being strongly linked, underscoring their central roles in global research collaboration. Some countries, like Germany and the UK, seem to act as bridges, connecting different clusters and enhancing international cooperation.
Another vital aspect to consider when analyzing the dataset is the distribution of publications based on research areas. Figure 6 reveals the top 20 research areas, offering understanding into where climate and ML intersect. The treemap is dominated by environmental sciences and ecology, leading with 2900 publications, which constitutes more than 20% of the total. This is followed by geology, with 1440 publications, nearly half the number of the leading field. Meteorology and atmospheric sciences, remote sensing, and engineering are other notable fields with substantial contributions in the studied domain. Although there are fewer publications in fields like agriculture, forestry, and physics, they are essential to the broader research landscape. Using this map, we illustrate the diversity of research areas, containing areas from multiple broad categories such as life sciences and biomedicine, physical sciences, and technology, with a strong emphasis on environmental and earth sciences, highlighting the integration of technology and engineering.
Further studying the research areas, a useful tool in this direction is the thematic map, whose methodology was inspired by the proposal of Cobo et al. [55]. Using this map, we group different keywords and plot them on two axes: centrality (the degree of relevance to the studied theme) and density (the degree of innovation and importance). The map is divided into four quadrants based on these axes. Figure 7 displays the thematic map derived from the keywords of the extracted articles. The Walktrap algorithm was employed as the parameter for constructing the map.
The first quadrant contains niche themes highlighting focused research efforts that, while specialized, have lower overall relevance to the central theme of climate–ML integration. They reflect the initiatives in addressing specific environmental and health (for humans and animals) challenges, with pollution, health, mortality grouped together, along with events, circulation, ocean. The second quadrant, in the top right, presents motor themes, with particular emphasis on forest, accuracy, and biodiversity, pivotal themes in advancing understanding and innovation. These themes showcase high centrality, being closely related to the topic of interest. The third quadrant, at bottom left, features emerging or declining themes, characterized by predictive algorithms like support vector machine and logistic regression, alongside the Geographic Information System (gis). These areas may be rising as new focal points within the field or declining as their applications and effectiveness are assessed. We recommend future exploration in this area. The last quadrant shows basic themes, with the foundational themes of climate change, prediction, and classification, which provide a groundwork for further exploration. Additionally, model, climate, and temperature represent core topics consistently applied across studies, serving as a reliable base for more innovative research. By identifying well-established, niche, and emerging areas, studying this map aids in strategically allocating resources and shaping future research agendas more effectively.
The Web of Science Core Collection includes a category of interest named “Sustainable Development Goals” (SDGs). We further analyze the articles gathered for our research to see how they align with various goals within the SDGs, where we notice some publications addressing multiple goals simultaneously. The most popular goal in terms of the number of publications is Goal 13, Climate Action, with 4700 publications, indicating the critical importance of addressing climate-related challenges through research. This is followed by Goal 15, Life On Land, which has nearly 2000 articles, and Goal 14, Life Below Water, with about 1700 publications, both of them accentuating the interconnection of environmental preservation and sustainable practices. Goal 6, Clean Water and Sanitation, has 1200 papers. Further notable goals include Goal 3, Good Health and Well Being, with 960 papers, and Goal 11, Sustainable Cities and Communities, with 900 papers. The findings from both the research areas and SDGs highlight the critical role of interdisciplinary research in promoting sustainable development. This interconnected approach is vital for achieving comprehensive and effective solutions in line with global sustainability targets.
For the last step in this study’s exploratory data analysis, we examined the main sources of publications to determine where researchers typically publish their studies of the intersection of climate and ML. In Figure 8, we see the most popular journals by the number of publications. Remote Sensing leads with 625 articles, followed by Sustainability with 186, Scientific Reports with 167, and Water with 163. The remaining journals have a similar number of publications, indicating that research in this domain is published across a variety of sources.

4.2. Factorial Map for Keywords Plus Using EFA

An EDA method for assessing similarities between keywords is to draw a factorial map. A factorial map is the visual result of an EFA process for reducing data dimensionality. The data basis for EFA in this study was the co-occurrence matrix of keywords generated using the Bibliometrix package. This co-occurrence matrix, derived from bibliometric data, served as the input for EFA. In Figure 9, we present a factorial map with four clusters obtained after using the Multiple Correspondence Analysis method with the mentioned R package visualizing keyword similarities. The two dimensions explain 57% of the total data variance, capturing a good portion of the variability in the data. The first dimension captures the primary variance in the data, distinguishing themes by the most prominent differences, while the second dimension captures the variance not explained by Dim 1, such as specificity or innovation within themes.
The cluster in the top left area is characterized by keywords such as simulation, neural networks, prediction, optimization, and systems. This highlights specialized areas of research where advanced analytics techniques are employed to address complex climate issues. The diversity of advanced analytics-related keywords suggests a robust range of methodologies being utilized to study climate change with the use of ML. In the top right quadrant, the cluster includes keywords like rainfall, variability, drought, precipitation, and weather. These terms are vital for understanding climate dynamics and mark significant environmental challenges related to water resources and climate variability. The bottom left cluster features fundamental algorithms, such as support vector machine, classification, and random forest. This suggests an ongoing exploration of these techniques and their applications in climate research, reflecting a commitment to enhancing analytical capabilities within the field. Finally, the cluster in the bottom right showcases foundational themes like climate change, emissions, impacts, management, forest, and water. These keywords are essential for understanding the broader implications of climate research. Overall, the factorial map sheds light on the diverse methodologies and key themes prevalent in climate and ML research, offering insights into where current efforts are concentrated and potential areas for further exploration.

4.3. Text Mining

Bibliometric datasets comprise extensive texts and strings that can be analyzed to gain deeper insights into research within the climate and ML domain. We use various text mining techniques and examine these texts, particularly Keywords Plus and abstracts, as outlined below.

4.3.1. Keywords Plus Word Cloud

One of the fields in bibliometric datasets is Keywords Plus, which are index terms automatically generated by the Web of Science Core Collection. These keywords are derived from analyzing the titles of cited articles, focusing on the article’s bibliography. They are different than the authors’ keywords, simply because they were automatically generated, not predetermined by the authors. We performed the initial analysis on Keywords Plus, as they offer a consistent way of expressing themes.
To be able to obtain the word cloud, some pre-processing was necessary for the Keywords Plus field. Initially, we exploded the data for easier manipulation, as there were multiple values contained in a single column. Missing values were filled with blank text. Upon detailed examination, we grouped similar terms due to differences like singular vs. plural forms or partial terms, such as models and model; artificial neural networks and artificial neural network; climate and climate change, since they refer to the same aspect; impacts and impact; algorithm and algorithms; river, basin, and river-basin. Only the top 20 most popular keywords were included to maintain the word cloud’s readability.
Figure 10 displays the resulting word cloud. The central theme of the cloud is the dominant climate change, as a key focus of the dataset. The term model indicates the frequent use of modeling techniques to understand complex environmental phenomena. Although machine learning is not explicitly visible, it manifests through various analyses, such as the popular artificial neural network and random forest, along with methods like regression, both simple and logistic; support vector machine; or advanced time series models, all among the top 20 most popular keywords. Prediction and classification highlight the analytical focus of these studies, indicating the importance of forecasting and categorizing environmental data. Terms like impact and variability demonstrate interest in understanding the effects of climate change and the fluctuations within environmental systems. temperature and precipitation, vital to climate studies, reflect ongoing investigations into these important factors, alongside weather and air pollution. Hydrological terms such as river-basin and water emphasize the importance of water resource management in climate research. Overall, the word cloud effectively captures the diverse themes and sophisticated techniques explored in this interdisciplinary research area, accentuating the integration of technology and science to tackle environmental issues.

4.3.2. Knowledge Graph for Pairs of Keywords Plus

Keywords can also be analyzed based on their interactions to identify commonly co-occurring pairs and uncover patterns in how the keywords are typically grouped together. This analysis helps reveal thematic connections and areas of focus within the research. Following the pre-processing steps from the previous analysis, we introduced a further step, in which we created a dictionary of pairs and their occurrences. Among the top 20 most popular keyword pairs, Climate Change appeared in 13 pairs, indicating its dominant role. The most common pair, Climate Change and Model, was found in 302 articles, followed by Climate Change with Impact in 275 publications. Other notable pairs include Climate Change with Temperature, in 145 papers and Climate Change with Artificial Neural Networks in 141 papers. Vegetation also frequently co-occurs with climate-related topics, appearing 125 times, highlighting the significance of environmental studies.
For a visual approach to studying the pairs, we developed a knowledge graph using Node2Vec. Node2Vec embeds the graph nodes and preserves the network neighborhoods, and it is used to represent the complex graph structure of the keyword pairs’ relationships. We also applied the K-Means clustering approach to the embedded nodes to partition the data into distinct clusters. The combination of Node2Vec and K-Means enables a detailed exploration of the network, providing a structured view of the data and enhancing the understanding of the underlying themes and connections in the climate and ML research.
This newly obtained graph is hard to read, but it contains four distinctive clusters. The decision to use four clusters is based on a visual inspection of different graphs, which indicated that four clusters effectively represent the underlying structure of the data without overfitting. Node2Vec was trained using specific parameters to optimize the embeddings for our dataset. These parameters included a random walk length set at 30, which allows for sufficient context to effectively capture node relationships. We initiated 200 random walks from each node to ensure the embeddings are robust. A smaller window size of 10 was chosen to focus on local dependencies. Additionally, a minimum number of appearances of 1 was set to have all occurrences taken into consideration for embedding.
K-Means was fit using standard parameters, providing a straightforward approach to cluster the embeddings effectively. The combination of these parameters and the four-cluster structure provides a coherent representation of the data, facilitating further analysis and interpretation of the underlying themes.
The four distinct clusters contain keywords grouped based on their semantic similarities and interactions. The first cluster, consisting of nodes on the outer edge of the graph, includes keywords for models like advanced time-series methods and support vector machine, along with vegetation, forest, and land use. This cluster is named “Ecosystem analysis using ML”, as it addresses uncertainties and patterns in environmental studies, focusing on growth and development. The second cluster, also situated on the periphery, is named “Water management and predictive analytics”. It focuses on water-related dynamics using ML techniques (like random forest) to predict trends, assess performance, and manage risks associated with water resources. Moving toward the interior of the graph, the third cluster is called “Climate impact modeling and prediction”, reflecting a focus on regression and artificial neural networks to model the impacts of precipitation and temperature on various climate-related topics. The last cluster contains the central nodes that play pivotal roles in the network and is named “Climate change modeling and classification”. It is focused on the geographic context of China, as indicated by the centrality of the country’s node. This cluster captures techniques such as classification, modeling, and variability to analyze environmental impacts in specific areas. Overall, these clusters highlight the critical role of integrating diverse methodologies to address complex global challenges in climate science.

4.3.3. Bigrams Word Cloud from Abstracts Mining

The other field of interest for text mining techniques is the analysis of abstracts, comprehensive texts authored by researchers that contain diverse and intriguing information. For the initial analysis, we created a word cloud using bigrams extracted from these texts. Bigrams, or pairs of two words, effectively identify underlying themes by highlighting shared ideas among different publications. To extract these bigrams, all words in the abstracts were tokenized, retaining only alphabetic tokens. These words were then paired based on their proximity, and each pair’s occurrence was recorded in a dictionary for easy manipulation. Before plotting the word cloud, we conducted additional pre-processing to remove stop words, common terms, and duplicates. Interestingly, after removing these common words, we noticed that the term climate was not among the top 50 most popular bigrams. Climate-change, with the hyphen, was not considered a bigram in this part. This suggests that while the term itself may not frequently appear in bigrams, its implications, particularly regarding climate change, are omnipresent, as earlier findings confirmed.
Figure 11 depicts the word cloud obtained after extracting bigrams from the abstracts, created for the top 50 most popular bigrams. There is significant emphasis on temporal data analysis (time series) and ML applications (ml models); these bigrams occupy the most space in the cloud. Most applications focus on metrics such as root error or error rmse to evaluate the employed model’s performance, focusing on their importance. The pair degrees c indicates the critical role of temperature studies in climate change research. Various models are mentioned, the most popular being rf model (Random Forest), support svm and hybrid model, along with ensemble model, signaling the use of diverse advanced approaches. Predictions and analyses are realized on different themes, such as extreme events, crop yield, and soil properties, suggesting interest in extreme weather events and agriculture-related studies. With this word cloud, we illustrate the methodological diversity and the interdisciplinary nature of the research, highlighting once again how ML techniques are applied to address climate change challenges across various domains.

4.3.4. LDA from Abstracts Mining

For the final analysis in text mining information from the abstracts, we implemented the Latent Dirichlet Allocation algorithm, a popular NLP method. Tasked with identifying topics within all the given texts, this unsupervised ML technique provides information of value, as its results may not be as evident as the results obtained in the research we have discussed so far.
Before presenting the results for the LDA algorithm, we identified the optimal number of topics by running the algorithm with varying numbers of topics on the same corpus (texts, or abstracts in our study’s case) and using the same dictionary. To refine this dataset, we also eliminated lower extremes by removing the popular words with fewer than five appearances and common stop words. We did not remove words that appeared in more than 50% of publications. Figure 12 illustrates the coherence scores (a measure used in comparing LDA algorithms) for different topic numbers ranging from 2 to 20. The scores were calculated using a library in Python for computing coherence scores for LDA. We used the c_v parameter for the computations. There is a positive trend, indicating that the more topics are selected, the more improved is the topic clarity, facilitating its interpretability. The higher the coherence score, the more qualitative is the topic. However, none of the topics achieve a coherence score above 50%, indicating challenges in meaningful interpretation and the presence of overlapping themes that are difficult to distinguish. The highest coherence scores occur at 13 topics and between 18 and 20.
The 13-topics model, despite its high coherence, resulted in numerous topics that were closely related and sometimes difficult to distinguish from each other. This overlap complicated the task of interpreting the results meaningfully, which led us to pick a smaller number of topics. The reduced number of topics allows for a more straightforward narrative in our analysis, making it easier for stakeholders to grasp the core themes without becoming lost in finer distinctions.
The number of topics chosen for the LDA model is a critical factor that significantly impacts the quality and interpretability of the results. In this case, we selected seven topics based on a combination of coherence scores and interpretive clarity, improving the ease of plotting and displaying the results. The selection of seven topics represents a strategic compromise between coherence and usability, which makes the seven-topics model the optimal choice for our analysis.
The seven topics obtained after running the LDA algorithm on the bag of words from the abstracts are plotted as word clouds in Figure 13. Each topic was given a descriptive name that aligns with the associated keywords. The first topic, “Hydrological and climate impacts”, includes terms like precipitation, water, river, lake, flood, rainwater, and region, suggesting the focus on water-related climate impacts and regional hydrological changes. The second topic, “Agricultural and soil observations”, features terms like yield, soil, crop, and vegetation, indicating existing and emerging research performed on agriculture and soil management. The third topic, “Species distribution and environmental changes”, has words like species, environmental, habitat, and plant, which point to biodiversity and ecological studies. Another topic, “Forest mapping and land classification”, contains the words for forest, tree, map, land, and classification, suggesting another area of interest for climate research, forest analysis and land use. The fifth topic, “Oceanic and atmospheric studies”, with terms like ocean, sea, cloud, and emission, presents a clear focus on atmospheric and oceanic research, emphasizing the interconnectedness of ecosystems. The sixth topic, “Predictive modeling and performance analysis”, contains algorithm, regression, rf, accuracy, and error, highlighting work in predictive analytics and ML. The last topic, “Methodological approaches and energy systems”, contains words like energy, system, network, analysis, and algorithm, reflecting a focus on methodological and systemic approaches across various fields. All these topics highlight central research areas within the intersection of climate and ML, reflecting diverse fields such as hydrology, agriculture, biodiversity, atmospheric studies, ML, and systems analysis.
Figure 14 presents the seven topics obtained after applying LDA on the abstracts, using an Intertopic Distance Map. The visualization was performed using the pyLDAvis [56] package in Python. The topics are numbered from 1 to 7, with visualization offering a spatial perspective on their relationships, an additional piece of information compared with the knowledge graph studied before.
Topic 1, “Methodological approaches and energy systems”, occupies the largest space, with 24% of the tokens, indicating recurring themes in the corpus related to methods, systems, research, and framework, along with themes like energy and power. Topic 2, “Predictive modeling and performance analysis”, is larger in size; it overlaps with topic 7, “Agricultural and soil observations”, suggesting shared themes, particularly in prediction analytics applied to agriculture. Topic 3, “Forest mapping and land classification”, is positioned the furthest from the other topics, highlighting niche themes in climate research, including forest and vegetation analysis. Topic 5, “Oceanic and atmospheric studies”, is the furthest from 3, but closer to topic 1 and reflects research related to biodiversity and oceanic diversity. Topic 4, “Hydrological and climate impacts”, tangent to topic 6, “Species distribution and environmental changes”, indicates a relationship between themes such as water-related climate impacts and species distribution. Figure 14 helps to better understand the thematic overlaps and distinctiveness among the topics, learning about the complex thematic structure of research at the intersection of climate science and ML.

5. Conclusions

Our study reveals the rapidly expanding intersection of climate change and ML research, characterized by a substantial annual growth of 37.39% in publications from 2004 to 2024. This signifies an increasing recognition of ML’s potential to address complex climate-related challenges. The field’s youth, highlighted by the initial publications in 2004, and its collaborative nature, suggest a vibrant environment for innovation and knowledge sharing. The high average number of citations per document indicates significant engagement and relevance within the existing literature. However, the lack of highly cited or hot papers raises important questions about how this niche area is recognized within the broader research landscape. Further investigation is needed to understand this aspect. Our findings directly address the research questions posed in the introduction, RQ1 (growth of ML in climate research) being just previously described.
For the second and third research questions, the countries with the most contributions and the nature of collaboration, we notice that international synergy plays a prominent role in the field, with leading contributions from the USA, China, and Germany, and high cooperation rates from countries like the United Kingdom and Australia. The diverse distribution of publications across various research areas, including environmental sciences, geology, meteorology, and remote sensing, shows the interdisciplinary approach needed to effectively address climate change. This reflects the complex nature of climate change issues and the multitude of approaches needed to better understand its underlying aspects.
The various data analysis methods (exploratory data analysis, bibliometrics, and text mining with topic modeling) we perform in this paper have provided useful information on the thematic structure of the literature. This addresses RQ4, concerning the advantages of methodological complementarity and RQ6, regarding the diverse approaches used in this domain. Tools like knowledge graphs and topic modeling (LDA) have highlighted key themes and connections, reinforcing the significant role of ML in climate research. Using LDA for text mining of abstracts, we answer RQ5 by identifying seven optimal topics, each revealing the key themes found at the intersection between climate research and ML: (1) methodological approaches and energy systems, (2) predictive modeling and performance analysis, (3) forest mapping and land classification, (4) agricultural and soil observations, (5) oceanic and atmospheric studies, (6) hydrological and climate impacts, and (7) species distribution and environmental changes. We also utilized a knowledge graph to visualize the results of text mining Keywords Plus pairs with Node2Vec and K-Means, revealing somewhat distinct clusters: ecosystem analysis using ML, water management and predictive analytics, climate impact modeling and prediction, and climate change modeling and classification. These findings collectively respond to our research questions, particularly RQ7, which addresses the domains connected to climate research utilizing machine learning methods, highlighting the breadth of applications across diverse fields.
While our study presents helpful information, certain limitations must be acknowledged. First, it relies solely on the Web of Science Core Collection database, which may introduce bias by excluding publications from other databases or less prominent journals. Focusing only on open-access documents might also limit the presence of significant research that is not freely available, narrowing the analysis. Additionally, missing data for certain variables, such as “Highly Cited Status” and “Hot Paper Status”, may affect the reliability of our findings.
Moreover, the used methodologies also present certain challenges, especially regarding the subjective nature of the topic modeling process, where we decided on the optimal number of topics, certain stop words, and what other words were to be eliminated from the process. The parameters used in all the models (Node2Vec, K-Means, LDA, factorial analysis, thematic map) can influence the results.
Lastly, our study primarily uses quantitative data, and incorporating qualitative data could provide better knowledge of the research context. Addressing all these limitations would require expanding the scope of the research, using multiple databases, incorporating qualitative data, and adopting elaborated analytical approaches. Future studies could focus on addressing these points to further refine the understanding of the relationships between climate change and ML research.
In conclusion, our study demonstrates that machine learning is not just a supplementary tool, but an essential and increasingly dominant component of climate change research. By identifying key themes, methodologies, and collaborative patterns, this research contributes to a deeper understanding of this dynamic field and informs strategies for future research and resource allocation. As the field continues to evolve, integrating diverse methodologies and data sources is essential to proper understanding the complex interactions between climate change and ML [18].

Author Contributions

Conceptualization, A.-M.N.; Methodology, A.-M.N.; Software, A.-G.V.; Validation, A.-M.N., S.-V.O., A.B. and A.-I.A.; Formal analysis, A.-M.N.; Investigation, A.-I.A.; Resources, A.-G.V.; Data curation, A.-G.V.; Writing—original draft, A.-M.N. and A.-G.V.; Writing—review & editing, S.-V.O., A.B. and A.-I.A.; Visualization, A.-I.A.; Supervision, S.-V.O. and A.B.; Project administration, A.B.; Funding acquisition, S.-V.O. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by a grant of the Ministry of Research, Innovation and Digitization, CNCS/CCCDI-UEFISCDI, project number ERANET-ERAMIN-3-ValorWaste-1, within PNCDI IV.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare that there is no conflict of interest.

References

  1. Papadopoulos, T.; Balta, M. Climate Change and big data analytics: Challenges and opportunities. Int. J. Inf. Manag. 2022, 63, 102448. [Google Scholar] [CrossRef]
  2. Abbass, K.; Qasim, M.Z.; Song, H.; Murshed, M.; Mahmood, H.; Younis, I. A review of the global climate change impacts, adaptation, and sustainable mitigation measures. Environ. Sci. Pollut. Res. 2022, 29, 42539–42559. [Google Scholar] [CrossRef] [PubMed]
  3. Scott, D. Sustainable Tourism and the Grand Challenge of Climate Change. Sustainability 2021, 13, 1966. [Google Scholar] [CrossRef]
  4. IPCC. Special Report on Global Warming of 1.5 °C. IPCC Spec. Rep. Impacts Glob. Warm. 2018, 1, 93–174. [Google Scholar]
  5. IPCC. Climate Change and Land: An IPCC Special Report on Climate Change, Desertification, Land Degradation, Sustainable Land Management, Food Security, and Greenhouse Gas Fluxes in Terrestrial Ecosystems; IPCC: Geneva, Switzerland, 2019. [Google Scholar]
  6. Take Climate Action in Your Community. The White House, 27 January 2021. [Online]. Available online: https://www.whitehouse.gov/climate/ (accessed on 12 October 2024).
  7. China’s Policies and Actions for Addressing Climate Change; Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2022.
  8. Burnett, N.; Stewart, I.; Hinson, S.; Tyers, R.; Hutton, G.; Malik, X. The UK’s Plans and Progress to Reach Net Zero by 2050; House of Commons Library: London, UK, 2024. [Google Scholar]
  9. Ukoba, K.; Onisuru, O.R.; Jen, T.-C. Harnessing machine learning for sustainable futures: Advancements in renewable energy and climate change mitigation. Bull. Natl. Res. Cent. 2024, 48, 99. [Google Scholar] [CrossRef]
  10. Materia, S.; García, L.P.; van Straaten, C.O.S.; Mamalakis, A.; Cavicchia, L.; Coumou, D.; de Luca, P.; Kretschmer, M.; Donat, M. Artificial intelligence for climate prediction of extremes: State of the art, challenges, and future perspectives. WIREs Clim. Change 2024, 15, e914. [Google Scholar] [CrossRef]
  11. Mukherjee, D.; Lim, W.M.; Kumar, S.; Donthu, N. Guidelines for advancing theory and practice through bibliometric research. J. Bus. Res. 2022, 148, 101–115. [Google Scholar] [CrossRef]
  12. Haunschild, R.; Bornmann, L.; Marx, W. Climate Change Research in View of Bibliometrics. PLoS ONE 2016, 11, e0160393. [Google Scholar] [CrossRef]
  13. Grieneisen, M.; Zhang, M. The current status of climate change research. Nat. Clim Change 2011, 2, 72–73. [Google Scholar] [CrossRef]
  14. Li, J.; Wang, M.-H.; Ho, Y.-S. Trends in research on global climate change: A Science Citation Index Expanded-based analysis. Glob. Planet. Change 2011, 77, 13–20. [Google Scholar] [CrossRef]
  15. Rolnick, D.; Donti, P.L.; Kaack, L.H.; Kochanski, K.; Lacoste, A.; Sankaran, K.; Ross, A.S.; Milojevic-Dupont, N.; Jaques, N.; Waldman-Brown, A.; et al. Tackling Climate Change with Machine Learning. ACM Comput. Surv. 2022, 55, 1–96. [Google Scholar] [CrossRef]
  16. Watson-Parris, D. Machine learning for weather and climate are worlds apart. Philos. Trans. R. Soc. A: Math. Phys. Eng. Sci. 2021, 379, 20200098. [Google Scholar] [CrossRef] [PubMed]
  17. Zennaro, F.; Furlan, E.; Simeoni, C.; Torresan, S.; Aslan, S.; Critto, A.; Marcomini, A. Exploring machine learning potential for climate change risk assessment. Earth-Sci. Rev. 2021, 220, 103752. [Google Scholar] [CrossRef]
  18. Ardabili, S.; Mosavi, A.; Dehghani, M.; Várkonyi-Kóczy, A. Deep Learning and Machine Learning in Hydrological Processes Climate Change and Earth Systems a Systematic Review. In Engineering for Sustainable Future. INTER-ACADEMIA 2019. Lecture Notes in Networks and Systems; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 101. [Google Scholar]
  19. Chen, L.; Han, B.; Wang, X.; Zhao, J.; Yang, W.; Yang, Z. Machine Learning Methods in Weather and Climate Applications: A Survey. Appl. Sci. 2023, 13, 12019. [Google Scholar] [CrossRef]
  20. Uthirapathy, S.; Sandanam, D. Topic Modelling and Opinion Analysis on Climate Change Twitter Data Using LDA and BERT Model. Procedia Comput. Sci. 2022, 218, 908–917. [Google Scholar] [CrossRef]
  21. Wu, M.; Long, R.; Chen, F.; Chen, H.; Bai, Y.; Cheng, K.; Huang, H. Spatio-temporal difference analysis in climate change topics and sentiment orientation: Based on LDA and BiLSTM model. Resour. Conserv. Recycl. 2013, 188, 106697. [Google Scholar] [CrossRef]
  22. Ejaz, W.; Ittefaq, M.; Jamil, S. Politics triumphs: A topic modeling approach for analyzing news media coverage of climate change in Pakistan. J. Sci. Commun. 2023, 22, A02. [Google Scholar] [CrossRef]
  23. Văduva, A.-G.; Munteanu, M.; Oprea, S.-V.; Bâra, A.; Niculae, A.-M. Understanding Climate Change and Air Quality over the Last Decade: Evidence from News and Weather Data Processing. IEEE Access 2023, 11, 144631–144648. [Google Scholar] [CrossRef]
  24. Deo, K.; Prasad, A. Exploring Climate Change Adaptation, Mitigation and Marketing Connections. Sustainability 2022, 14, 4255. [Google Scholar] [CrossRef]
  25. Zhang, Y.; Tao, J.; Wang, J.; Ding, L.; Ding, C.; Li, Y.; Zhou, Q.; Li, D.; Zhang, H. Trends in Diatom Research Since 1991 Based on Topic Modeling. Microorganisms 2019, 7, 213. [Google Scholar] [CrossRef] [PubMed]
  26. Zou, T.; Guo, P.; Li, F.; Wu, Q. Research topic identification and trend prediction of China’s energy policy: A combined LDA-ARIMA approach. Renew. Energy 2024, 220, 119619. [Google Scholar] [CrossRef]
  27. Dayeen, F.R.; Sharma, A.S.; Derrible, S. A text mining analysis of the climate change literature in industrial ecology. J. Ind. Ecol. 2020, 24, 276–284. [Google Scholar] [CrossRef]
  28. Zhao, Y.; Zhang, Y.; Guo, J.; Wang, J.; Li, Y. Shifts in periphyton research themes over the past three decades. Environ. Sci. Pollut. Res. 2023, 30, 5281–5295. [Google Scholar] [CrossRef] [PubMed]
  29. Sharifi, A.; Simangan, D.; Kaneko, S. Three decades of research on climate change and peace: A bibliometrics analysis. Sustain. Sci. 2020, 16, 1079–1095. [Google Scholar] [CrossRef]
  30. AR4 Climate Change 2007: Synthesis Report; IPCC: Geneva, Switzerland, 2007.
  31. Birkle, C.; Pendlebury, D.; Schnell, J.; Adams, J. Web of Science as a data source for research on scientific and scholarly activity. Quant. Sci. Stud. 2020, 1, 363–376. [Google Scholar] [CrossRef]
  32. Tennant, J.P.; Waldner, F.; Jacques, D.C.; Masuzzo, P.; Collister, L.B.; Hartgerink, C.H.J. The academic, economic and societal impacts of Open Access: An evidence-based review. F1000Research 2016, 5, 1–57. [Google Scholar] [CrossRef]
  33. Osborne, J.W. Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data; SAGE Publications, Inc.: New York, NY, USA, 2012. [Google Scholar]
  34. Baillie, M.; Cessie, S.L.; Schmidt, C.O.; Lusa, L.; Huebner, M. Ten simple rules for initial data analysis. PLoS Comput. Biol. 2022, 18, e1009819. [Google Scholar] [CrossRef] [PubMed]
  35. Komorowski, M.; Marshall, D.C.; Salciccioli, J.D.; Crutain, Y. Exploratory Data Analysis. In Secondary Analysis of Electronic Health Records, MIT Critical Data; Springer: Berlin/Heidelberg, Germany, 2016; pp. 185–203. [Google Scholar]
  36. Unwin, A. Exploratory Data Analysis. In International Encyclopedia of Education, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2010; pp. 156–161. [Google Scholar]
  37. Exploratory Data Analysis and Data Envelopment Analysis of Construction and Demolition Waste Management in the European Economic Area. Sustainability 2020, 12, 4995. [CrossRef]
  38. Yong, A.G.; Pearce, S. A Beginner’s Guide to Factor Analysis: Focusing on Exploratory Factor Analysis. Tutor. Quant. Methods Psychol. 2013, 9, 79–94. [Google Scholar] [CrossRef]
  39. Lee, S.-Y. Handbook of Latent Variable and Related Models; Elsevier: Amsterdam, The Netherlands, 2007. [Google Scholar]
  40. Hair, J.F.; Anderson, R.E.; Black, W.C. Multivariate Data Analysis, 7th ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  41. Yang, S.; Yuan, Q.; Dong, J. Are Scientometrics, Informetrics, and Bibliometrics Different? Data Sci. Informetr. 2020, 1, 103597. [Google Scholar]
  42. Aria, M.; Cuccurullo, C. Bibliometrix: An R-tool for comprehensive science mapping analysis. J. Informetr. 2017, 11, 959–975. [Google Scholar] [CrossRef]
  43. Glänzel, W. Bibliometrics as a Research Field: A Course on Theory and Application of Bibliometric Indicators; Course Handouts, 2003. Available online: https://www.researchgate.net/publication/242406991_Bibliometrics_as_a_research_field_A_course_on_theory_and_application_of_bibliometric_indicators (accessed on 1 November 2024).
  44. Fensel, D.; Şimşek, U.; Angele, K.; Huaman, E.; Kärle, E.; Panasiuk, O.; Toma, I.; Umbrich, J.; Wahler, A. What Is A Knowledge Graph? In Knowledge Graphs: Methodology, Tools and Selected Use Cases; Springer: Berlin/Heidelberg, Germany, 2020; pp. 1–10. [Google Scholar]
  45. Grohe, M. word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. In PODS’20: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems; ACM: New York, NY, USA, 2020. [Google Scholar]
  46. Ahmed, M.; Seraj, R.; Islam, S.M.S. The k-means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  47. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  48. Griffiths, T.L.; Steyvers, M. Finding scientific topics. Proc. Natl. Acad. Sci. USA 2004, 101, 5228–5235. [Google Scholar] [CrossRef]
  49. Jelodar, H.; Wang, Y.; Yuan, C.; Feng, X.; Jiang, X.; Li, Y.; Zhao, L. Latent Dirichlet Allocation (LDA) and Topic modeling: Models, applications, a survey. Multimed. Tools Appl. 2019, 78, 15169–15211. [Google Scholar] [CrossRef]
  50. Al Sailawi, A.; Kangavari, M. Analyzing the Use of Social Media Data to Understand Long-Term Crisis Management Challenges of COVID-19. Fusion Pract. Appl. 2023, 14, 227–243. [Google Scholar] [CrossRef]
  51. O’Callaghan, D.; Greene, D.; Carthy, J.; Cunningham, P. An analysis of the coherence of descriptors in topic modeling. Expert Syst. Appl. 2015, 42, 5645–5657. [Google Scholar] [CrossRef]
  52. Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the WSDM 2015—Proceedings of the 8th ACM International Conference on Web Search and Data Mining, Shanghai, China, 2–6 February 2015. [Google Scholar]
  53. Perga, M.-E.; Sarrasin, O.; Steinberger, J.; Lane, S.N.; Butera, F. The climate change research that makes the front page: Is it fit to engage societal action? Glob. Environ. Change 2023, 80, 102675. [Google Scholar] [CrossRef]
  54. Mobility 2030: Meeting the Challenges to Sustainability; World Business Council for Sustainable Development (WBCSD): Chicago, IL, USA, 2004.
  55. Cobo, M.; López-Herrera, A.; Herrera-Viedma, E.; Herrera, F. Science mapping software tools: Review, analysis, and cooperative study among tools. J. Am. Soc. Inf. Sci. Technol. 2011, 62, 1382–1402. [Google Scholar] [CrossRef]
  56. Sievert, C.; Shirley, K. LDAvis: A method for visualizing and interpreting topics. In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, Baltimore, MD, USA, 27 June 2014; pp. 63–70. [Google Scholar]
Figure 1. Research flow methodology.
Figure 1. Research flow methodology.
Sustainability 16 11086 g001
Figure 2. EDA steps, inspired by [37].
Figure 2. EDA steps, inspired by [37].
Sustainability 16 11086 g002
Figure 3. Bibliometric techniques.
Figure 3. Bibliometric techniques.
Sustainability 16 11086 g003
Figure 4. Evolution of the number of publications and the average citations per year from 2004 to 2024.
Figure 4. Evolution of the number of publications and the average citations per year from 2004 to 2024.
Sustainability 16 11086 g004
Figure 5. Top 50 countries based on their number of published articles in climate research and ML.
Figure 5. Top 50 countries based on their number of published articles in climate research and ML.
Sustainability 16 11086 g005
Figure 6. Most popular 20 research areas by number of publications in the climate and ML research.
Figure 6. Most popular 20 research areas by number of publications in the climate and ML research.
Sustainability 16 11086 g006
Figure 7. Thematic map of grouped Keywords Plus for the climate–ML publications.
Figure 7. Thematic map of grouped Keywords Plus for the climate–ML publications.
Sustainability 16 11086 g007
Figure 8. Top 20 most popular journals by number of publications containing articles about climate research and ML.
Figure 8. Top 20 most popular journals by number of publications containing articles about climate research and ML.
Sustainability 16 11086 g008
Figure 9. Factorial map of Keywords Plus for climate–ML publications.
Figure 9. Factorial map of Keywords Plus for climate–ML publications.
Sustainability 16 11086 g009
Figure 10. Word cloud from the Keywords Plus field; climate and ML research.
Figure 10. Word cloud from the Keywords Plus field; climate and ML research.
Sustainability 16 11086 g010
Figure 11. Word cloud from abstract bigrams; climate and ML research.
Figure 11. Word cloud from abstract bigrams; climate and ML research.
Sustainability 16 11086 g011
Figure 12. Coherence score for LDA vs. number of topics.
Figure 12. Coherence score for LDA vs. number of topics.
Sustainability 16 11086 g012
Figure 13. LDA topics as word clouds.
Figure 13. LDA topics as word clouds.
Sustainability 16 11086 g013
Figure 14. LDA—topic visualization using pyLDAvis.
Figure 14. LDA—topic visualization using pyLDAvis.
Sustainability 16 11086 g014
Table 1. Overview of bibliometrics on climate research and ML.
Table 1. Overview of bibliometrics on climate research and ML.
General TopicValues
Timespan2004–2024
Sources (journals, books, etc.)1356
Documents7521
Annual growth rate %37.39
Document average age 2.11
Average citations per document 17.87
Average citations per year per document 4.152
Keywords Plus 11,345
Authors’ keywords 17,103
Author appearances45,767
Authors of single-authored documents 142
Single-authored documents 149
Documents per author0.231
Co-authors per document 6.09
Table 2. Overview of corresponding authors’ countries.
Table 2. Overview of corresponding authors’ countries.
COUNTRYARTICLESFREQUENCYSCPMCPMCP_RATIO
USA14820.198310284540.306
China13890.18598934960.357
Germany6210.08313432780.448
United Kingdom3410.04561282130.625
Australia2850.03811181670.586
Canada2580.03451541040.403
Italy2300.03081291010.439
Republic of Korea2130.0285133800.376
Spain2130.0285126870.408
India2050.0274141640.312
SCP: Single country publication; MCP: multiple country publication.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niculae, A.-M.; Oprea, S.-V.; Văduva, A.-G.; Bâra, A.; Andreescu, A.-I. Assessing the Role of Machine Learning in Climate Research Publications. Sustainability 2024, 16, 11086. https://doi.org/10.3390/su162411086

AMA Style

Niculae A-M, Oprea S-V, Văduva A-G, Bâra A, Andreescu A-I. Assessing the Role of Machine Learning in Climate Research Publications. Sustainability. 2024; 16(24):11086. https://doi.org/10.3390/su162411086

Chicago/Turabian Style

Niculae, Andreea-Mihaela, Simona-Vasilica Oprea, Alin-Gabriel Văduva, Adela Bâra, and Anca-Ioana Andreescu. 2024. "Assessing the Role of Machine Learning in Climate Research Publications" Sustainability 16, no. 24: 11086. https://doi.org/10.3390/su162411086

APA Style

Niculae, A. -M., Oprea, S. -V., Văduva, A. -G., Bâra, A., & Andreescu, A. -I. (2024). Assessing the Role of Machine Learning in Climate Research Publications. Sustainability, 16(24), 11086. https://doi.org/10.3390/su162411086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop