4.1 Form-based Approaches
According to Table
3, we note that most form-based approaches are time-oblivious. A few time-aware approaches have recently appeared, and they are all characterized by the adoption of a specific fine-tuning operation to inject time information into the model. All the current work leverages unsupervised learning modalities with the exception of Reference [
6]. The aggregation stage is mostly based on averaging, while clustering is only enforced in Reference [
12], where a cluster represents the dominant sense of the word
w. In particular, in Reference [
12], a word is considered as changing when clustering the embeddings
\(\Phi _1\) and
\(\Phi _2\) via K-means with
\(k=2\) generates two groups where one of the two clusters contains at least 90% of the embeddings from one corpus only (i.e.,
\(C_1\) or
\(C_2\)).
In form-based approaches, the following change functions are proposed for measuring the semantic change s:
Cosine distance (CD). The change
s is measured as the
cosine distance (CD) between the word prototypes
\(\mu _1, \mu _2\) as follows:
where
CS is the
cosine similarity between the prototypes. Intuitively, the greater the
\(CD(\mu _1, \mu _2)\), the greater the change in the dominant sense of
w.
Typically, the prototypes
\(\mu _1\) and
\(\mu _2\) are determined through aggregation by averaging over
\(\Phi _1\) and
\(\Phi _2\), respectively (e.g., Reference [
91]). As a difference, in Reference [
57], the prototype embedding
\(\mu _2\) at timestep
\(t=2\) is computed by updating the prototype embedding
\(\mu _1\) at timestep
\(t=1\) through a weighted running average (e.g., Reference [
38]).
In Reference [
91], the CD metric is employed in a multilingual experiment where the change is measured across a diachronic corpus with texts of different languages. This is the only example of cross-language change detection.
CD is also used in time-aware approaches. The integration of extra-linguistic information into word embeddings, such as time and social space, has been proposed in previous work based on static LMs [
121,
144]. Recently, this integration has been also applied to contextualized embeddings [
60,
119]. In Reference [
56], a pre-trained LLM is fine-tuned to encapsulate time and social space in the generated embeddings. Then, the change
s is assessed by computing the CD between embeddings generated by the original pre-trained model and the embeddings generated by the time-aware, fine-tuned model. In particular, in Reference [
145], a
temporal referencing mechanism is adopted to encode time-awareness into a pre-trained model. Temporal referencing is a pre-processing step of the documents that tags each occurrence of
w in
\(C_1\) and
\(C_2\) with a special marker denoting the corpus/time in which it appears [
34,
37]. The embeddings of a tagged word are learned by fine-tuning the LLM for domain-adaptation. In this case,
s is assessed by computing the CD between
\(\mu _{[1]}\) and
\(\mu _{[2]}\), where
\([i]\) denotes
w with the temporal marker
\(t_i\). Similarly to Reference [
145], a time-aware approach is proposed in Reference [
116] where a time marker is added to documents instead of words and the LLM is fine-tuned to predict the injected time information (i.e., time masking). This way, there is no need to add a tag for each target word and its various forms (e.g., singular, plural), thereby avoiding the inclusion of additional new tokens in the LLM’s vocabulary. As an alternative, in Reference [
117], a
temporal attention mechanism is adopted to generate the embeddings
\(\Phi _1\) and
\(\Phi _2\) for calculating CD.
Inverted similarity over word prototype (PRT). This measure is proposed as an alternative to CD for improving the effectiveness of the change detection [
73]. The
inverted similarity over word prototypes (PRT) measure is defined as:
Time-diff (TD). This measure is designed for time-aware approaches, and it works on analyzing the change of polysemy of a word along time. It is based on the model capability to predict the time of a document, and it calculates the change
s by considering the probability distribution of the predicted times [
116]. Intuitively, a uniform distribution means that the association document-time is not strong enough to clearly entail a change. Instead, a non-uniform distribution means that there is evidence to predict the time of a document. Consider a document
d, let
\(p_j(d)\) be the probability of
d to belong to the time
\(t_j\). The function
time diff (TD) is defined as the average difference of the predicted time probabilities:
The experiments conducted in Reference [
116] demonstrate that TD outperforms CD in short-term semantic change when their performance is compared on the task of Graded Change Detection across various benchmarks. On the contrary, CD outperforms TD over long-term semantic change. Reference [
116] argues that TD is less effective on long-term periods, since major differences in writing style emerge and the prediction of document-time associations is less reliable.
Average pairwise distance (APD). This measure exploits the variance of the contextualized representations
\(\Phi _1\),
\(\Phi _2\) to compute the semantic change assessment (i.e., variance on the word polysemy). As a difference with the previous measures, APD directly works on word embeddings without requiring any aggregation stage, namely, clustering nor averaging. The
average pairwise distance (APD) is defined as follows:
where
d is an arbitrary distance measure (e.g., cosine distance, Euclidean distance, Canberra distance). According to the experiments performed in Reference [
45], APD better performs when the Euclidean distance is employed as
d. In Reference [
67], APD is used over the embeddings
\(\Phi _1\) and
\(\Phi _2\) by applying a dimensionality reduction through the
Principal Component Analysis (PCA). In Reference [
67], experiments on both slang and non-slang words are performed through causal analysis to study how distributional factors (e.g., polysemy, frequency shift) influence the change
s. The results show that slang words experience fewer semantic change than non-slang words.
In Reference [
70], lexical substitutes are used to assess
s. A set of lexical substitutes is generated by leveraging a masked LLM (e.g., XLM-R) and word representations
\(\Phi _1\), and
\(\Phi _2\) are computed as
bag-of-substitutes. Then, APD is finally computed over
\(\Phi _1\), and
\(\Phi _2\) to assess
s.
APD is also used in a time-aware approach described in Reference [
110], where a pre-trained BERT model is fine-tuned to predict the time period of a sentence. APD is finally used to measure the change between the embeddings extracted from the fine-tuned LLM.
In Reference [
6], APD is employed to measure the change
s over the embeddings
\(\Phi _1\) and
\(\Phi _2\) extracted from a supervised
Word-in-Context model (WiC) [
109]. This LLM is trained to reproduce the behavior of human annotators when they are asked to evaluate the similarity of the meaning of a word
w in a pair of given sentences from
\(C_1\) and
\(C_2\), respectively. The embeddings
\(\Phi _1\) and
\(\Phi _2\) are extracted from the trained WiC model for calculating the final APD measure.
Average of average inner distances (APD-OLD/NEW). The APD-OLD/NEW measure is presented in Reference [
81] as an extension of APD, and it estimates the change
s as the average degree of polysemy of
w in the corpora
\(C_1\) and
\(C_2\), respectively. The
average of average inner distances (APD-OLD/NEW) is defined as:
where AID is the
average inner distance, and it measures the degree of polysemy of
w in a specific time frame by relying on the APD measure, namely,
\(AID(\Phi _1) = APD(\Phi _1, \Phi _1)\) and
\(AID(\Phi _2) = APD(\Phi _2, \Phi _2)\), respectively.
Hausdorff distance (HD). The change
s is measured as the
Hausdorff distance (HD) between the word embeddings
\(\Phi _1\) and
\(\Phi _2\). Similarly to APD, HD directly works on word embeddings without requiring any aggregation stage. HD relies on the Euclidean distance
d to measure the difference between the embeddings of
w in
\(C_1\) and
\(C_2\), and it returns the greatest of all the distances
d from one embedding
\(e_1 \in \Phi _1\) to the closest embedding
\(e_2 \in \Phi _2\) or vice versa. The HD measure is defined as follows:
The experiments performed in Reference [
138] show that HD is sensitive to outliers, since it is based on infimum and supremum, thus an outlier embedding may largely affect the final
s value.
Difference between token embedding diversities (DIV). Similar to APD, this measure assesses the change
s by exploiting the variance of the contextualized representation
\(\Phi _1\) and
\(\Phi _2\). As a difference with APD, the
difference between token embedding diversities (DIV) leverages a coefficient of variation calculated as the average of the cosine distances
d between the embeddings
\(\Phi _1\) and
\(\Phi _2\) and their prototypical embeddings
\(\mu _1\) and
\(\mu _2\), respectively [
72]. The intuition is that when
w is used in just one sense, its embeddings tend to be close to each other, yielding a low coefficient of variation. On the opposite, when
w is used many different senses, its embeddings are distant to each other, yielding to a high coefficient of variation. DIV is defined as the absolute difference between the coefficient of variation in
\(C_1\) and
\(C_2\):
In Reference [
72], the experiments show that when the coefficient of variation is low, the prototypical embeddings
\(\mu _1\) and
\(\mu _2\) successfully represent the meanings of the given word
w. On the opposite, when the coefficient of variation is high, the prototypical embeddings
\(\mu _1\) and
\(\mu _2\) do not provide a relevant representation of the
w meanings.
4.2 Sense-based Approaches
According to Table
4, we note that all the sense-based approaches are time-oblivious and that fine-tuning is sometimes adopted, but mainly for domain-adaptation purposes. Most papers leverage unsupervised learning modalities. Only a few exceptions employ a lexicographic supervision (i.e., References [
58,
113,
114]). As a difference with form-based, sense-based approaches usually enforce clustering in the aggregation stage. The aggregation by averaging is only exploited in References [
58,
100,
105], where sense prototypes are computed on top of the results of a clustering operation.
When clustering is adopted, the function
f that calculates the change
s can be directly defined over the embeddings
\(\Phi _1\) and
\(\Phi _2\). As an alternative, the function
f can be defined over the distribution of the embeddings in the resulting clusters (i.e.,
cluster distribution). In this case, as a result of the clustering operation, a counting function
c is used to determine two cluster distributions
\(p_1\) and
\(p_2\) that represent the normalized number of embeddings in the cluster partitions
\(\phi _{1,i}\) and
\(\phi _{2,i}\), respectively (see Section
2). The
ith value
\(p_{j,i}\) in
\(p_j\) (with
\(j \in \lbrace 1, 2\rbrace\)) represents the number of embeddings of
\(\phi _{j,i}\) in the
ith cluster, namely:
\(p_{j,i} = \frac{|\phi _{j,i}|}{|\Phi _j|} .\) Finally, the function
f is defined as a compound function
\(f = g \ \circ \ c\), where the result of the
c function is exploited by a change function
g that works on the cluster distributions
\(p_1\) and
\(p_2\).
In sense-based approaches, the following change functions are proposed for measuring the semantic change s:
Maximum novelty score (MNS). This measure exploits the cluster distributions
\(p_1\) and
\(p_2\) by leveraging the idea that the higher is the ratio between the number of embeddings
\(\Phi _1\) and
\(\Phi _2\) in a cluster, the higher is the semantic change of the considered word
w. The
maximum novelty score (MNS) is defined as:
where
\(NS(p_{1,i}, p_{2,i}) = p_{1,i}/p_{2,i}\) is the
novelty score proposed in Reference [
28], and
k is the number of clusters produced as a result of the aggregation stage.
In Reference [
58], MNS is employed as a change measure in a supervised learning approach. In particular, a lexicographic supervision (i.e., the Oxford English dictionary) is employed to provide the meanings of the target word
w. Each word occurrence in
\(\Phi _1\) and
\(\Phi _2\) is associated with the closest meaning of the dictionary according to the cosine distance. As a result, for each word/dictionary meaning, a cluster of word embeddings is defined and MNS is exploited to calculate the overall change.
Maximum square (MS). This measure is an alternative to MNS to assess the change of
s. The intuition of MS is that slight changes in cluster distributions
\(p_1\) and
\(p_2\) may occur due to noise and do not represent a real semantic change [
115]. The
maximum square (MS) aims at identifying strong changes in the cluster distributions. As a difference with MNS, the square difference between
\(p_{1,i}\) and
\(p_{2,i}\) is used to capture the degree of change instead of the
novelty score (NS):
Jensen-Shannon divergence (JSD). This measure extends the
Kullback-Leibler (KL) divergence, which calculates how one probability distribution is different from another. The
Jensen-Shannon divergence (JSD) calculates the change
s as the symmetrical KL score of the cluster distributions
\(p_1\) from
\(p_2\), namely:
where KL is the Kullback-Leibler divergence and
\(M=(p_1+p_2)/2\).
JSD is also used in approaches where aggregation by clustering is performed separately over the embeddings
\(\Phi _1\) and
\(\Phi _2\) [
64]. As a result, the clusters need to be aligned to determine the distributions
\(p_1\) and
\(p_2\) before the JSD calculation. As a difference with Reference [
64], an evolutionary clustering algorithm is employed in Reference [
105] to apply the JSD measure without requiring any alignment step over the resulting clusters.
As a final remark, JSD can be employed to measure the change
s over more than two time periods. However, the experiments in Reference [
45] show that the JSD effectiveness over a single time period outperforms the version over more time periods, since JSD is insensitive to the order of the temporal intervals.
Coefficient of semantic change (CSC). This measure is proposed as an alternative to JSD, where the difference over the weighted number of elements in
\(\phi _{1,i}\) and
\(\phi _{2,i}\) for each cluster
i is employed to replace KL in measuring the change [
64]. The
coefficient of semantic change (CSC) is defined as follows:
where
\(P_j = \sum ^k_{i=1} p_{j,i}\) is the weight of each cluster distribution and
k is the number of clusters.
Cosine distance between cluster distributions (CDCD). As a further alternative of JSD, this measure assesses the change
s by considering the cluster distributions
\(p_1\) and
\(p_2\) as vectors and by applying the cosine distance over them to assess the semantic change
s. The
cosine distance between cluster distributions (CDCD) is defined as follows:
In Reference [
7], CDCD is calculated between the cluster distributions
\(p_1\) and
\(p_2\) obtained by enforcing clustering over bag-of-substitutes (see the description of Reference [
7] in Section
4.1).
Entropy difference (ED). This measure is based on the idea that the higher is the uncertainty in the interpretation of a word occurrence due to the
w polysemy in
\(C_1\) and
\(C_2\), the higher is the semantic change
s. The intuition is that high values of ED are associated with the broadening of a word’s interpretation, while negative values indicate a narrowing interpretation [
45]. The
entropy difference (ED) is defined as follows:
where
\(\eta (p_j)\) is the degree of polysemy of
w in the corpus
\(C_j\), which is calculated as the normalized entropy of its cluster distribution
\(p_j\):
As shown in Reference [
45], ED is not capable of properly assessing
s when new usage types of
w emerge, while old ones become obsolescent at the same time, since it may lead to no entropy reduction.
Cosine distance between semantic prototypes (PDIS). This measure is presented in Reference [
105] as an extension of the CD measure adopted by form-oriented approaches. The idea of PDIS is that the aggregation by averaging over cluster prototypes can be employed to produce summary descriptions of the cluster contents (i.e.,
semantic prototypes). The
cosine distance between semantic prototypes (PDIS) is defined as the CD between
\(\bar{c}_1\),
\(\bar{c}_2\), that is:
where
\(\bar{c}_1\) and
\(\bar{c}_2\) are semantic prototypes defined as the average embeddings of all the sense prototypes
\(c_{1,i}\) and
\(c_{2,i}\), respectively.
Difference between prototype embedding diversities (PDIV). This measure is presented in Reference [
105] as an extension of the DIV measure adopted by form-oriented approaches. PDIV leverages the same intuition of PDIS, namely, the semantic prototypes can be employed to calculate the coefficient of ambiguity of
w by measuring the difference between a semantic prototype
\(\bar{c}_j\) and each sense prototype
\(c_{j,i}\). The
difference between prototype embedding diversities (PDIV) is defined as the absolute difference between these ambiguity coefficients:
where
\(\Psi _1\) and
\(\Psi _2\) denote the set of sense prototypes of
\(c_{1,i}\) and
\(c_{2,i}\), respectively.
Average pairwise distance (APD). In addition to form-based approaches (see Section
4.1), the APD measure is exploited to assess
s also in sense-based approaches. In References [
113,
114], APD is applied to the contextualized embeddings
\(\Phi _1\) and
\(\Phi _2\) extracted from a fine-tuned XLM-R model. In particular, an English corpus is used to fine-tune the pre-trained LLM to select the most appropriate WordNet’s definition for each word occurrence [
14]. As a result of the fine-tuning, both WordNet’s definitions and word occurrences are embedded in the same vector space, and the meaning of any word occurrence can be induced by selecting the closest definition in the vector space. In Reference [
113], the zero-shot, cross-lingual transferability property of XLM-R is exploited to obtain word representations for Russian language and APD is finally applied [
23,
27]. Reference [
113] claims that the approach is useful to overstep the lack of lexicographic supervision for low-resource languages and that most concept definitions in English also hold in other languages, such as Russian. However, this claim is not completely satisfied, since some words can drastically change their meaning across languages. For example, the Russian word “
” (i.e., pioneer, scout) is strongly connected to the Communist ideology in the Soviet Period, but it is not in the English language.
Average pairwise distance between sense prototypes (APDP). This measure is an extension of APD, and it considers all the pairs of sense prototypes
\(c_{1,i}\) and
\(c_{2,i}\) instead of all the original embeddings in
\(\Phi _1\) and
\(\Phi _2\) [
66]. The
average pairwise distance between sense prototypes (APDP) is defined as:
Wassertein distance (WD). This measure models the change assessment as an
optimal transport problem, and it is exploited as an alternative to cluster alignment when aggregation by clustering is performed separately over the embeddings
\(\Phi _1\) and
\(\Phi _2\) [
100]. WD quantifies the effort of re-configuring the cluster distribution of
\(p_1\) into
\(p_2\), namely, minimizing the cost of moving one unit of mass (i.e., a sense prototype) from
\(\Psi _1\) to
\(\Psi _2\). The
Wassertein distance (WD) is defined as:
where all
\(\gamma _{c_{1,i} \rightarrow c_{2,j}}\) represents the (unknown) effort required to reconfigure the mass distribution
\(p_1\) into
\(p_2\);
\(k_1\) and
\(k_2\) are the number of clusters obtained by clustering
\(\Phi _1\) and
\(\Phi _2\), respectively;
CD is the cosine distance computed over the sense prototypes
\(c_{1,i} \in \Psi _1\) and
\(c_{2,j} \in \Psi _2\) [
17].
4.3 Ensemble-based Approaches
In this section, we review the approaches that rely on an ensemble mechanism, namely, the combination of two or more assessment functions to determine the semantic change score. Ensembling can mean that more than one form- and/or sense-based measure is adopted in a given approach. Ensembling can also mean that a disciplined use of both static and large LMs is used. A final semantic change score is then returned by the whole ensemble process.
According to Table
5, we note that all the ensemble approaches are time-oblivious with the exception of References [
110] and [
117]. We also note that unsupervised learning modalities are adopted with the exception of Reference [
113]. As a further remark, most of the ensemble solutions exploit LLMs trained over different languages.
Some ensemble approaches combine form-based and sense-based measures to improve the quality of results. On the one hand, form-based measures are exploited to better capture the dominant sense of the target word
w. On the other hand, sense-based measures are exploited to represent all the meanings of
w, including the minor ones. The combination of CD (see form-based approaches in Section
4.1) and JSD (see sense-based approaches in Section
4.2) is proposed in Reference [
93]. As a further ensemble experiment, the results of combining APD, HD, and JSD are discussed in Reference [
138]. The APD measure is also considered in Reference [
113], where multiple change scores are calculated by using different distance metrics (e.g., Manatthan distance, CD, Euclidean distance), and these scores are exploited to train a regression model as an ensemble.
Ensemble approaches based on two form-based measures are also proposed. For instance, in Reference [
46], the final semantic change
s is obtained by averaging APD and PRT scores. This is motivated by experimental results where sometimes APD outperforms PRT, while some other times PRT outperforms APD [
73].
Some other ensemble approaches are based on the idea to combine static and contextualized embeddings. The intuition is that static embeddings can capture the dominant sense of the target word
w better than form-based, contextualized embeddings. In References [
110,
134], the semantic change
s is assessed by leveraging both static and contextualized embeddings. In particular,
s is determined by the linear combination of the scores obtained by two approaches: (i) the APD measure over contextualized embeddings (see form-based approaches in Section
4.1); (ii) the CD measure over static embeddings aligned according to the approach described in Reference [
53]. Similarly, in Reference [
93], instead of directly using the APD measure, JSD is exploited over clusters of contextualized embeddings (see sense-based approaches in Section
4.2). As a further difference, the scores obtained by static and contextualized approaches are combined by multiplication. The intuition is that, since the score distributions of the two approaches are unknown, multiplication prevents an approach from contributing more than the other one in the final score.
Approaches can also be combined with grammatical profiles under the intuition that grammatical changes are slow and gradual, while lexical contexts can change very quickly [
46,
77]. Grammatical profile vectors
\(gp_1\) and
\(gp_2\) are associated with the times
\(t_1\) and
\(t_2\), respectively, to represent morphological and syntactical features of the considered language in the time period. In Reference [
122], the contextualized embeddings of the word
w occurrences are combined with the grammatical vectors. A linear regression model with regularization is trained by using as features the cosine similarities over
\(\Phi _1\) and
\(\Phi _2\) and over the grammatical vectors
\(gp_1\) and
\(gp_2\).
As a further ensemble approach, the combination of different time-aware techniques such as temporal attention and time masking was tested by Reference [
117] to better incorporate time into word embeddings.
4.4 Discussion
According to Sections
4.1–
4.3, we note that form-based approaches are more popular than sense-based ones. Most papers are characterized by time-oblivious approaches, and only a few time-aware approaches have recently appeared (e.g., Reference [
117]). All approaches leverage unsupervised learning modalities with few exceptions (e.g., Reference [
58]). We argue that the motivation is due to the recent introduction of a reference evaluation framework for semantic change assessment proposed at SemEval-2020 Shared Task 1, where participants were asked to adopt an unsupervised configuration [
125].
All papers are featured by contextualized word embeddings extracted from BERT-like models. Regardless of their version (i.e., tiny, small, base, large), BERT and XLM-R are the most frequently used LLMs, and only a few experiments rely on ELMo and RoBERTa. As a matter of fact, the size of data needed to train or fine-tune an XLM-R model is several orders of magnitude greater than BERT. Moreover, even if less frequently employed than BERT, ELMo seems to be promising for LSC and outperforms BERT, while being much faster in training and inference [
73]. As a further interesting remark, the use of static
document embeddings extracted from a Doc2Vec[
84] model has been proposed to provide pseudo-contextualized
word embeddings as an alternative to BERT [
105].
Monolingual and multilingual LLMs are both popular. The BERT models are the most frequently used monolingual models. XLM-R models are generally preferred to
mBERT (multilingual BERT) models, since the former are trained on a larger amount of data and languages, thus the intuition is that they can better encode the language usages. Multilingual models are used both in multilingual settings, where corpora of different languages are considered (e.g., Reference [
91]), and monolingual settings, where just corpora of one language are given (e.g., in Reference [
46]). In a monolingual setting, the use of a multilingual model is motivated by two reasons: (i) a model pre-trained on a specific language is not available (e.g., Reference [
73]), (ii) multilingual models are employed to exploit their cross-lingual transferability property (e.g., Reference [
113]).
Considering the type of training, most of the papers directly use pre-trained LLMs or fine-tune them for domain adaptation. Only a few papers propose to exploit a specific fine-tuning (e.g., Reference [
110]) or to incrementally fine-tune a pre-trained LLM (e.g., Reference [
73]). Experiments indicate that fine-tuning a pre-trained LLM for domain adaptation consistently boosts the quality of results when compared against pre-trained LLMs (e.g., Reference [
112]). The impact of fine-tuning on performance is analyzed in Reference [
92], where it is shown that optimal results are achieved by fine-tuning a pre-trained LLM for five epochs and that, after five epochs, performance decreases due to overfitting. However, we argue that the fine-tuning effectiveness strictly depends on the size and domain of the considered corpora. In many papers, a different number of epochs is proposed with varying results (e.g., Reference [
73]).
When an LLM is used, contextualized word embeddings are typically extracted from the last one or the last four layers of the model. Experiments show that the semantic features of text are mainly encoded in the last four encoder layers of BERT [
33,
62]. In some papers, contextualized embeddings are extracted by aggregating the output of the first and the last encoded layers. In this case, the idea is to combine
surface features (i.e., phrase-level information, [
62]) encoded in the first layer with the semantic features from the last one. Only in Reference [
81] is the standalone use of lower layers of BERT proposed. Middle layers of BERT are usually excluded, since they mainly encode syntactic features [
62]. When contextualized embeddings are extracted from more than one layer, they are generally aggregated by average or sum (e.g., Reference [
105]). As an alternative, the use of concatenation is proposed in Reference [
64].
As a further note, when an LLM is used, some words may be split into word pieces by a subword-based tokenization algorithm [
129,
140]. In this case, word piece representations are generally synthesized into a single word representation
\(e_{j,k}\) through averaging (e.g., Reference [
91]) or concatenating (e.g., Reference [
93]). As alternative to avoid such problem, the pre-trained vocabulary associated with the LLM can be extended by adding some words of interest. Then, a fine-tuning step is performed to learn the weights associated with the added words (e.g., Reference [
116]).
Clustering operations are typically exploited in sense-based approaches to perform Word Sense Induction [
1,
4,
83,
90]. The only form-based approach that relies on clustering is presented in Reference [
12] (see Section
4.1 for details). The clustering algorithms most frequently employed are K-means and
Affinity Propagation (AP). Further considered clustering algorithms are
Gaussian Mixture Models (GMMs) (e.g., Reference [
118]),
agglomerative clustering (AGG) (e.g., Reference [
7]), DBSCAN (e.g., Reference [
65]), HDBSCAN (e.g., Reference [
118]),
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) (e.g., Reference [
118]),
A Posteriori affinity Propagation (APP) (e.g., Reference [
105]), and
Incremental Affinity Propagation based on Nearest neighbor Assignment (IAPNA) (e.g., Reference [
105]). Since K-means, GMMs, and AGG require to define the number of clusters in advance, the use of a silhouette score is generally employed to determine the optimal number of clusters [
120]. As an alternative, the AP algorithm is employed to let emerge the number of clusters without prefixing it. DBSCAN is proposed due to its capability of reducing noise by specifying (i) the minimum number of embeddings of each cluster and (ii) the maximum distance
\(\epsilon\) between two embeddings in a cluster. HDBSCAN is the hierarchical version of DBSCAN, and it can manage clusters of different sizes. As a difference with DBSCAN, HDBSCAN can detect noise without the
\(\epsilon\) parameter. APP and IAPNA are incremental extensions of AP, and their use is proposed for LSC when more than one time interval is considered. In Reference [
118], different clustering algorithms are compared and the experiments show that (i) DBSCAN is very sensitive to scale, since
\(\epsilon\) is predefined, and (ii) BIRCH tends to find a lot of small clusters that are marginal with respect to word meanings.
Considering the change functions, a detailed presentation of possible alternatives has been provided in Sections
4.1 and
4.2. As a final remark, we note that CD and APD are frequently exploited in form-based approaches, while JSD is commonly employed in sense-based approaches.
Finally, as for the language of considered corpora, most papers consider the shared benchmark datasets taken from competitive evaluation campaigns (e.g., LSCDiscovery, [
143]). Common considered languages are English, German, Latin, and Swedish that appeared in 2020 at SemEval Task 1 [
125]. Russian appeared in 2021 at RuShiftEval [
75,
76]. Spanish appeared in 2022 at LSCDiscovery [
143]. The Italian language was introduced in 2020 at DIACRIta [
11]. The approach described in Reference [
91] represents a novel attempt to consider a diachronic corpus containing texts of different languages, namely, English and Slovenian.