1 Introduction
The Covid-19 pandemic has drastically affected people around the world, both in terms of how each one is facing his/her day-to-day life, and in terms of entire communities. For instance, the impact of restrictions adopted to mitigate the spread of the pandemic has recently been assessed in terms of urban crime [
49], socioeconomic conditions [
14], travel behavior [
8], and waste management [
55].
Different areas within the same city, some more than others, have experienced drastic shifts in how they have been lived, mainly due to rules concerning social distancing and prolonged lockdowns. For example, we can argue that areas around train stations have experimented an increase in criminality since a lower number of travelers pass along their streets. High-density and high-traffic areas may have instead experienced a decrease in road accidents and traffic jams because of a lower number of cars around. These shifts in urban life are an interesting aspect to monitor and profile for several reasons. First, the evaluation of the short-term impact of restrictions may provide a clearer picture of how the pandemic is actually affecting our lives. Second, the comprehension of long-term repercussions may be easier thanks to the development of appropriate tools to model changes in urban life. Third, the analysis of shifts may help local governments and security officers, who are in charge of managing public order and implementing forward-looking policies, to keep the city safe and livable.
Clearly, the problem becomes particularly interesting in the context of
smart cities. The idea of a smart city actually hinges on the ability to exploit technological advancements, in terms of algorithms and available data, to provide improvements in the quality of life and available services for residents. The last few years have seen an ever growing interest in the development of systems and frameworks to monitor and analyze various aspects of life within cities. Technology is more than ever posed to play an important role in shaping the cities of the future. It can help local governments in taking informed decisions for the infrastructural development of the city, for handling problems related, for example, to crime and weather threats, and more in general for improving the quality of life of citizens. Research on smart cities has been steadily growing in the last few years, also because it encompasses several different domains of analysis and applications, such as the environmental, social, and mobility ones [
23,
30,
62]. In this context, profiling city areas can be considered one of the main fields of application [
31,
57].
Several approaches have been proposed in the literature for profiling of city areas [
21]. One identified option relies on the use of structured data [
22], setting up a framework to explore Web sources such as points-of-interest (e.g., restaurants, museums), traffic information, and house pricing. Concerning instead the use of more unstructured data such as local online newspaper articles, it has been proposed [
15,
19] to exploit articles and tags to (i) identify a macro-categorization of news articles based on the semantic similarity between tags and (ii) classify news articles as belonging to one of such categories. The resulting information is used, on one hand, to cluster city areas based on the identified categories, and on the other hand to describe the various city areas in terms of the news reported for them. However, such approaches are not able to grasp the dynamic aspects of these phenomena, both in the case of generic topics like crimes and traffic, which are ordinarily covered by media, and in the case of new and emerging topics.
In this article, we aim to take into account the evolution of topics as well. Specifically, we want to propose a system for the automatic evaluation of modifications over time of the profiles of different city areas. To catch the main characteristics of the temporal evolution, we split continuous time into time windows and apply a purposely adapted streaming clustering algorithm for the identification of clusters of news articles in a given time window: this lets us uncover the differences between clustering outcomes obtained in adjacent windows. To track the evolution of clusters along subsequent windows, we introduce a set of metrics to describe the relation between a cluster in one window and clusters in the previous window based on the amount of shared members.
The evaluation of systems for profiling city areas is a rather complex task. The scope of the available research works is particularly broad, spanning different aspects and problems in the context of smart cities (See Section
2). Thus, to the best of our knowledge, no benchmark dataset is currently available to support direct evaluation and comparisons of models and analytical approaches. To overcome this limitation, we chose to quantitatively investigate the effectiveness of our framework via an experimental analysis on a paradigmatic case study. The experimental analysis has been carried out by considering news data for the city of Rome during 2020. The rationale behind this choice is twofold. First, it let us obtain a vast amount of news articles for a specific city, mostly geo-localized ones [
15]; secondly, it represents an interesting evaluation ground for the proposed methodology. In fact, news related to the Coronavirus, which were obviously not present in previous years and in the very beginning of 2020, emerged in the subsequent months and were featured prominently, also obviously intertwining with other aspects of city life. This second aspect is particularly interesting, as it may allow us to build more robust and easy-to-model systems that can take into account unexpected variations in the data and enable a more in-depth understanding of how the pandemic affected life in the city. Concerning the choice of the specific city, it is worth underlining that our pipeline is highly versatile: Nowadays, there exist local online newspapers for any city, and this makes an analysis with fine spatial granularity plausible for any selected target area. We chose Rome as the area is familiar to most of the authors, thus making it easier spotting out possible problems or anomalous aspects during the system development.
The end goal of our work is to enable a descriptive and fully unsupervised analysis of changes over time in city area profiles. In our view, it is worth underlining two useful different aspects of the proposed framework. On one hand, the extracted knowledge can be exploited to automatically describe city areas in terms of what is reported to be happening in them in near-real time and across time. Moreover, the use of news articles and an unsupervised pipeline allows us to (i) avoid focusing on specific aspects of interest (e.g., crime rates, housing prices) and to (ii) discover novel descriptors for city areas as they appear over time, as in the case of the Coronavirus pandemic. On the other hand, the obtained results could proficiently support applications of different types. As a first example, we may consider a web mapping service, such as the popular Google Maps: Whenever users request a route planning, or explore a specific area, they can get aggregate information about what has recently featured that area; furthermore, this information is always up to date, and historical data series may also be available. Thanks to the information provided by our framework, users could decide to avoid a dangerous area characterized by numerous crimes and ask for alternative routes. As a second example, we can consider a real estate online platform or a physical agency: clearly, an updated characterization of city areas could represent an important strategic asset in this domain.
The main contributions of this article hinge on the design, development and deployment of a news-based framework, featuring the functionalities of news collection, data representation and processing, clustering, and knowledge extraction from clustering results. In particular, the novel data analysis pipeline entails the following contributions:
—
The online news clustering task is addressed with a recently proposed density-based streaming clustering algorithm, adequately modified to automatically tune some of its parameters to adapt to evolving scenarios. Furthermore, an appropriate dimensionality reduction technique is employed to manage the high-dimensional real-valued vectors, i.e., embeddings, generated by language models to represent texts;
—
A novel method for cluster labeling is proposed, aimed at revealing the topics covered in the news aggregated in the clusters, by leveraging a set of tags obtained from news articles;
—
The identification of relevant patterns in the temporal evolution of clusters is addressed by introducing a novel set of metrics, defined to identify relationships among clusters in adjacent time windows;
—
An in-depth experimental investigation is carried out with the proposed framework, leveraging a case study regarding the city of Rome during the Covid-19 pandemic, in order to evaluate the impact of the pandemic over the city in terms of clusters of reported news.
The rest of this article is organized as follows. In Section
2, the literature on city profiling and
Natural Language Processing (
NLP) is evaluated. Section
3 provides some background on specific techniques applied in the present work. Section
4 overviews the proposed system for city news clustering. Furthermore, Section
5 thoroughly describes the data preprocessing and clustering stage, whereas Section
6 describes the novel approaches for cluster labeling and tracking, which enable the knowledge extraction from clustering results. In Section
7, we evaluate our approach on a case study featuring news about the city of Rome during the Coronavirus pandemic, and we thoroughly discuss the obtained results. Finally, in Section
8, we draw proper conclusions and describe future directions.
5 Data Preprocessing AND Clustering
TSF-DBSCAN addresses most of the requirements previously identified, but it still misses some crucial abilities. The scenario addressed in this article, in fact, is particularly hostile and precludes the plain application of an ordinary clustering algorithm for two main reasons: the high-dimensionality of the data, and the possible modification over time of the density of clusters. The TSF-DBSCAN.news has been developed to cope with these additional challenges.
Whenever objects are represented in a high-dimensional attribute space (as it is the case with word- and sequence-level embeddings), clustering algorithms struggle to get to significant results because of the so-called
curse of dimensionality problem [
39].
TSF-DBSCAN.news relies on the concept of density, which becomes less informative as the dataset dimensionality increases. In fact, the typical sparsity of data in high-dimensional attribute spaces makes all points appear similarly distant from one another. In other words, the distance of an object to the nearest object approaches the distance to the farthest one [
12]. Consequently, the concept of local neighborhood, which determines what objects are
core ones, turns out to be inappropriate to drive the cluster analysis. An adequate solution to this problem can rely on some form of dimensionality reduction as a preprocessing step, to make the downstream clustering procedure operate in a lower dimensional space. It has been shown [
41] that a manifold learning technique named
t-distributed Stochastic Neighbor Embedding (
t-SNE) [
59] can be used for this purpose since, in general, the number of disjoint clusters in the target space coincides with the number of disjoint clusters in the original space. In our work, we consequently revisit the TSF-DBSCAN algorithm: During the online stage, the embedding vector of each collected article is computed; when the offline stage is triggered (i.e., a periodic condition is met) we use
UMAP (
Uniform Manifold Approximation and Projection for Dimension Reduction) [
44], a recent enhanced version of t-SNE, to project the article level embeddings in a lower dimensional space. The main advantage of UMAP over t-SNE is that it is able to keep the global structure of the data, and to better scale up on large datasets [
36].
As for dealing with evolving scenarios, the original TSF-DBSCAN is inherently capable to handle non-stationary distributions, such as those characterized by the emergence, disappearance, or gradual movement of clusters. However, it is not flexible enough to adapt to changes over time of the density of clusters, since it exploits an initial
static configuration of the parameters
\(\varepsilon _\mathit {min}\) and
\(\varepsilon _\mathit {max}\). Such a variation in cluster density can derive from
concept drift [
29,
64], i.e., the non-stationarity of the data generation process, and it can likely afflict the stream of news, e.g., as a result of a burst of news on emerging “hot” topics. Furthermore, it is widely acknowledged that also in the original DBSCAN a very critical issue is the choice of the
\(\varepsilon\) parameter [
26], as it defines the neighborhood extent for any object, therefore implicitly affecting the number and shape of the detected clusters: Thus, the ability to update this parameter over time can provide a certain level of flexibility to the framework. For accommodating the evolving nature of news streams, in the proposed
TSF-DBSCAN.news, we included a stage for the automatic tuning of the threshold parameters prior to each reclustering step, based on a recently published approach [
10]: the statistical modeling of the density distribution of objects lets us shape a heuristic to estimate
\(\varepsilon _\mathit {min}\) and
\(\varepsilon _\mathit {max}\). The underlying idea recalls the original proposal of the authors of DBSCAN [
26] to resort to the
\(\mathit {k}\)-
\(\mathit {dist}\) function, which associates each object in the dataset with the distance from its
\(\mathit {k}\)th nearest neighbor. However, instead of requiring the user to manually select the distance threshold upon the visual analysis of the resulting
\(\mathit {k}\)-
\(\mathit {dist}\) plot, the heuristic automatically derives the parameter values for the fuzzy membership function on the basis of a Gaussian Mixture modeling of the
\(\mathit {k}\)-
\(\mathit {dist}\) array. Notably, the effectiveness of the approach has been previously shown on a number of synthetic benchmark datasets [
10], also in presence of artificially added noise.
Figure
2 shows the adapted version of the clustering algorithm exploited in this article, highlighting the steps of dimensionality reduction and adaptive tuning of threshold parameters.
The outcome of this module for the generic time window \(t\) is the set \(P^t\) of the identified disjoint clusters, informally named “partition”, along with the membership of each object to a cluster in \(P^t\) (or its recognition as an outlier). Notably, a defuzzification process may be applied on the output of TSF-DBSCAN so that each object is assigned to at most one cluster based on the highest membership degree.
6 Knowledge Extraction from Clustering Results
The last module of the proposed framework is in charge of extracting knowledge from the outcome of the previous clustering step. The activities to be carried out are the cluster labeling and cluster tracking, supported by dedicated sub-modules.
6.1 Cluster Labeling
The procedure proposed for cluster labeling is based on topic modeling, by exploiting embeddings for a semantics-aware comparison of possible labels [
7]. The labeling of any cluster can be carried out on the basis of the tags associated with the composing articles. As tags are standalone words, no context-sensitive representation is required for them (Sentence-BERT is unreasonably too complex for this task). Thus, a standard word embeddings model has been chosen, and specifically
fastText [
13,
35], which encodes each tag in a 300-dimensional space. An accurate yet flexible labeling can be based on a
reference tag dictionary: only the tags in this collection can be used for labels. Such a dictionary must contain a curated set of well-defined relevant tags, along with tags extracted from the most recent articles, which are possibly able to describe emerging topics. In the approach adopted for the definition and maintenance of the reference tag dictionary, we initially collect the article tags for a given period of time; this set is first filtered according to a minimum frequency threshold for the occurrences of each tag, and subsequently the tags that are not deemed sufficiently informative are removed and placed in a dedicated blacklist. For example, in our case study, “Rome” and other tags concerning locations of the city are not sufficiently informative for our purposes. As new articles arrive, the frequent relative tags not already present in the blacklist are temporarily added to the dictionary: periodically, a revision of the dictionary is performed to decide whether to definitely include the new tags in the set of relevant ones.
Let
\(t\) denote a generic time window. The labels to associate with each cluster
\(C_i^t\) are generated as follows. For each time window
\(t\), the clustering algorithm generates a partition
\(P^t = \mathit {set}(C_i^t)\); First, for each cluster
\(C_i^t\) in
\(P^t\), we identify the top-
\(K\) (
\(K=10\) in our experiments) most frequent tags along with their frequencies; if they are not present in the
reference tag dictionary, the
N/A label is assigned. Second, we obtain the distributed representation of each frequent tag (i.e., its
tag embedding) according to the pre-trained fastText model. Third, we compute the
centroid embeddings \(\overrightarrow{\mathit {WTag}}_{i}^t\) as the component-wise weighted average of the tags embeddings in each cluster
\(C_i^t\); tag embeddings are weighted by their frequency:
where
\(\overrightarrow{\mathit {tag}}_{i,k}\) is the embedding representation of the
\(k\)th most frequent tag in cluster
\(C_i^{t}\),
\(\mathit {f}_{i,k}\) represents the number of occurrences of
\(\overrightarrow{\mathit {tag}}_{i,k}\), and the summation notation indicates component-wise summation.
Finally, we identify the nearest neighbor of \(\overrightarrow{\mathit {WTag}}_{i}^{t}\) in the reference tag dictionary with respect to cosine similarity. Such a tag gives an indication on the principal topic for the cluster. Optionally, to grasp the nuances of cluster traits, a label may consist in an ordered sequence of tags: In our case study, we will report the three nearest neighbors.
6.2 Cluster Tracking
The clustering algorithm computes an output partition from scratch at each time window. Therefore, investigations on the evolution of profiles ask for a method to track clusters across consecutive time windows. The proposed method hinges on the definition of three metrics that characterize the purity, coverage, and preservation of clusters. The joint assessment of these metrics allows us to capture salient phenomena concerning the partition evolution, including emergence and merging of clusters, or whether and up to what extent a cluster is maintained over time.
In order to evaluate the proposed metrics, we must first devise a procedure to identify the relationship between objects/clusters of two consecutive windows. It is schematically depicted in Figure
3: After clustering generation at time
\(t\), objects of the previous window (
\(t-1\)) are projected into the current reduced space determined according to the dimensionality reduction transformation learned on the data in time window
\(t\). Then, the membership to clusters in
\(P^t\) of each projected object is evaluated, applying the criteria used in the adopted clustering algorithm for determining whether an object belongs to a cluster: In the following, this procedure is referred to as
virtual assignment.
In the following, we first introduce the notation and then we define the metrics to support cluster tracking across subsequent time windows.
Let
\(C_i^{t-1}\) and
\(C_j^t\) be two generic clusters in partitions
\(P^{t-1}\) and
\(P^t\) obtained for time windows
\(t-1\) and
\(t\), respectively. Using the notation
\(|\cdot |\) for set cardinality, the number of clusters in a generic window
\(t\) corresponds to
\(|P^t|\). Furthermore, we denote with
\(C_{i \rightarrow j}\) the set of objects in
\(C_i^{t-1}\) that are virtually assigned to cluster
\(C_j^t\) according to the procedure illustrated in Figure
3.
Purity of \(C_j^t\). This measure of disorder can be applied in case at least one object of a cluster in
\(P^{t-1}\) is virtually assigned to
\(C_j^t\). If this condition holds, we can indicate with
\(Q_j^{t-1} =\lbrace k \in P^{t-1} \,|\, C_{k \rightarrow j} \ne \varnothing \rbrace\) the subset of the clusters in partition
\(P^{t-1}\) with at least one object virtually assigned to
\(C_j^t\). Members of
\(Q_j^{t-1}\) are referred to as “contributing clusters”. The index is computed by exploiting the concept of
normalized entropy for a cluster
\(C_j^t\), defined as
where
\(p_{i \rightarrow j}\) is the fraction of objects in
\(C_{i \rightarrow j}\) w.r.t. all the objects from
\(Q_j^{t-1}\) virtually assigned to
\(C_j^t\), i.e.,
The normalized entropy ranges between 0 and 1. Thus, we can define the purity of cluster
\(C_j^t\) as
A high value of purity for cluster \(C_j^{t}\) occurs whenever the objects from the previous window that are virtually assigned to \(C_j^{t}\) largely originate from one single cluster. Conversely, a provenance evenly involving multiple clusters yields a low purity value.
Coverage between \(C_i^{t-1}\) and \(C_j^{t}\).Coverage is defined as follows:
The coverage ranges from 0 to 1, with 1 testifying that the number of objects belonging to \(C_{i \rightarrow j}\) is equal to, or higher than, the number of objects in \(C_j^{t}\). Incidentally, the cluster \(l\) in time window \(t-1\) that most contributes to \(C_j^{t}\) can be found as \(l = \underset{i \in P^{t-1}}{\mathrm{argmax}}(\mathit {Cov}_{i \rightarrow j}^t)\).
Preservation of \(C_i^{t-1}\) in \(C_j^{t}\).This metric evaluates the fraction of objects in cluster
\(C_i^{t-1}\) that are virtually assigned to cluster
\(C_j^{t}\). Preservation is defined as
The preservation ranges between 0 and 1. Preservation has its maximum value when all the objects in cluster \(C_i^{t-1}\) are virtually assigned to the same cluster \(C_j^t\).
Given two consecutive partitions \(P^{t-1}\) and \(P^t\), some notable patterns, which we name Continuity, Topic Emergence, Topics Fusion, and Topic Expansion, can be described in terms of the metrics defined above. Formally:
Let \(l = \underset{i \in P^{t-1}}{\mathrm{argmax}} (\mathit {Cov}_{i \rightarrow j}^t)\).
WHEN \(Pur_{j}^{t}\) is high
AND \({Cov}_{l \rightarrow j}^t\) is high
THEN \(C_j^{t}\) originates from \(C_l^{t-1}\).
Let \(l = \underset{i \in P^{t-1}}{\mathrm{argmax}} (\mathit {Cov}_{i \rightarrow j}^t)\).
WHEN \({Cov}_{l \rightarrow j}^t\) is low
THEN \(C_j^{t}\) represents an emerging topic.
WHEN \(\sum _{i \in P^{t-1}}(\mathit {Cov}_{i \rightarrow j}^t)\) is high
AND \(\sum _{i \in P^{t-1}} (\mathit {Pre}_{i \rightarrow j}^t)\) is high
THEN \(C_j^{t}\) is the fusion of different clusters from \(P^{t-1}\).
WHEN \(Pre_{i \rightarrow j}^{t}\) is high
AND \(Cov_{i \rightarrow j}^{t}\) is low
THEN \(C_j^{t}\) is expanding from \(C_i^{t-1}\).
Although other different patterns could be envisaged, we identified these ones as the most significant for our news stream investigations: In the following section, we show how they occur and are used in our monitoring campaign. Notably, uncovering the occurrence of one of these patterns is an important step in cluster evolution analysis, but it should be always complemented by information from the labeling of the involved clusters. Furthermore, it is worth underlining that the adjectives high and low provide a rough indication: automatic pattern matching requires to specify appropriate threshold values.
7 System Deployment AND Clustering Results: A Case Study On the City of Rome
The proposed framework exploits a data processing pipeline that has been developed in Python. To evaluate and demonstrate it usefulness, we set up a case study regarding the city of Rome. In the following, we first describe the experimental setup, with details on the dataset and on the system configuration; subsequently, we report and discuss the results of our experimental campaign.
7.1 Dataset Description
The reference dataset for our investigation is extracted from RomaToday,
1 a well-known online newspaper with news regarding exclusively the Italian Capital and its direct surroundings. An example of the relative article data and metadata is reported in Table
2. We collected news in a monitoring campaign from June 2019 to June 2020, roughly centered over the initial spread of Covid-19 in Italy, with the intent to catch its impact on the profiles of the city areas. The dataset contains about 15,000 news articles; Table
3 summarizes its main statistics. Notably, a non-negligible number of articles have no associated tag at all (1,202 out of 15,214), and this indicates how a categorization based on other elements of the article, such as the title and the summary, might be important in practice. Figure
4 shows the top-25 most frequent tags in the whole monitored period, providing an indicative picture of the most characterizing topics covered in our dataset. Moreover, the spatial distribution of the 2020 articles is reported as a heatmap in Figure
5; a grid with 1-km step is used over the urban area. Unsurprisingly, the highest density of news is present in downtown areas.
7.2 Configuration of the Framework Parameters
The numerical representation of news articles represents one of the key components of the proposed system. As per the choice of the language model, a variety of pre-trained Sentence-BERT models is available as part of the
sentence_transformers library.
2 Obviously, the choice should depend on the language used in the target news articles. Our case study concerns Italian newspapers, but, unfortunately, no Sentence-BERT model specifically tuned on Italian data is currently available; thus, we chose to consider a multilingual model (i.e.,
xlm-r-bert-base-nli-stsb-mean-tokens) as it can provide out-of-the-box reliable representations of sequences in a variety of languages.
We encode the title and the article summary with the Sentence-BERT model, and obtain the 768-dimensional article-level embeddings. Conversely, for tag embeddings, the fastText model for Italian has been used in the case study.
The parameters of the algorithms used in the proposed framework have been set according to the indications of the specialized literature. UMAP has several hyper-parameters that affect the dimensionality reduction operation, and the most important ones are the number of neighbors, the used metric, and the dimensionality of the reduced space. The number of neighbors controls the balance between preserving global structure and local structure in the reduced data space: the larger the value of this parameter, the larger the emphasis on global structure preservation [
44]. We executed UMAP with the recommended parameter setting (30 neighbors) to avoid noisy fine grained clusters. As for the metric, we chose cosine distance as it is considered a standard metric and widely exploited in NLP tasks to effectively measure the distance between word- and document-level embeddings. Furthermore, we projected the 768D Sentence-BERT embeddings in a lower dimensional feature space to make the downstream cluster analysis more effective. We set the dimensionality of the target space as 5, following the indications of the related literature [
7]. The
TSF-DBSCAN.news algorithm was executed with
\(\mathit {MinPts} = 5\), while the values of the distance thresholds
\(\varepsilon _{\mathit {min}}\) and
\(\varepsilon _{\mathit {max}}\) were automatically derived at each reclustering step, as described in Section
5. The offline stage of
TSF-DBSCAN.news was evaluated with a period of one month. Furthermore, we tuned the forgetting mechanism so that only the news articles collected in the previous five weeks, approximately, were considered as input for each reclustering step. In other words, we allowed an overlap of one week between two consecutive evaluations of the offline step of the clustering algorithm.
7.3 General Results
Table
4 shows some summarized statistics of the clustering results over the various time windows. Specifically, for each window, we report the number of articles, the number of clusters, the maximum and minimum number of articles per cluster, and the number of outliers. We can observe that the number of clusters generated at each time window ranges from a minimum of just three clusters in March 2020 to more than 20 clusters in the final period of our monitoring campaign. Such a flexibility of the density based clustering algorithm with respect to the number of discovered clusters let us model some interesting patterns; in fact, the topic of Covid-19 pandemic almost monopolized the online news in March 2020, while a substantial fragmentation of topics characterized the subsequent months.
Notably, the clustering operation spotted out also some outliers in several time windows. For instance, in July 2019 four clusters were identified in total, with computed labels referring to the topics “waste-emergency”, “deaths”, “heat”, and “thefts”. The outliers showed no similarity with such clusters, and their tags referred to “cubs”, “scouts”, and “people exploitation”.
Providing an overall effective representation of the clustering results is not trivial. Figure
6 shows a stacked bar plot that represents the main clusters discovered in the central period of our analysis, between December 2019 and May 2020: the height of each bar is proportional to the number of objects in the cluster. Furthermore, each cluster bar is labeled with the three nearest neighbors to
\(\overrightarrow{\mathit {WTag}}_i^{t}\), as defined in Equation (
1), along with their cosine similarity w.r.t.
\(\overrightarrow{\mathit {WTag}}_i^{t}\). For visualization purpose only, clusters with less than 100 objects are grouped and labeled as
others. A visual analysis of the figure indicates the presence of some dominant and recurring topics, with several clusters related to the tags “waste emergency” or “thefts” and “robberies”. Distinctly, this trend has been drastically disrupted by the emergence of the “coronavirus” topic. However, to get a more thorough understanding of the situation and to model the evolution of clusters over time, we resort to the newly defined cluster tracking metrics.
7.4 Metrics and Pattern Evaluation
In this section, we provide some examples of evaluation of the metrics defined in Section
6.2. Table
5 reports the computed values of these metrics for the cluster identified by the tags “waste emergency”, “waste”, and “road accidents” from the January 2020 time window. The high value of
Purity (
\(Pur = 0.745\)) suggests that most of the objects from the previous time window that are virtually assigned to this cluster come from one single cluster: In our case, 597 out of the 650 objects assigned to the current cluster (more than
\(90\%\)) originate from the same cluster. The source cluster, identified by the tags “waste emergency”, “waste”, and “toxic fires”, is fairly preserved in the current cluster (
\(Pre = 0.652\)) since 597 out of 916 objects are mapped therein. In addition, the number of objects projected from such a source cluster (597) is higher than the total number of objects of the current cluster (521). Thus, according to Equation (
5), we can report a
Coverage of 1.
Such high values for Coverage and Purity indicate a match for the Continuity pattern: the current cluster originates from a cluster from the previous window. Indeed, they share the tags “waste emergency” and “waste” with high cosine similarity values. Although other clusters from the previous time window also have objects mapped to the cluster under investigation, their Coverage values are very low, thus indicating no significant pattern match.
Another example of the computed metrics is provided in Table
6, related to the cluster labeled by “thefts”, “robberies”, and “car theft”. Also in this case, an occurrence of the
Continuity pattern is identified over a cluster from the previous time window that shares similar topics, and features exactly the same tags. Although it is evident that most of the examples projected onto the current cluster originate from one single cluster of the previous time window (348 out of 404), the
Purity value (
\(Pur = 0.419\)) is significantly lower than in the example discussed before. In the following, we empirically show that this value should nevertheless be considered as
high. Since the objects projected onto the current cluster originate from
two clusters only, we computed the purity value of all the possible pairs of integers that sum up to 404. Figure
7 reports the CDF of such array of purity values: The shape of the plot denotes a strong skewness of the distribution of purity values, and therefore the value found for the analyzed cluster, very close to the 75th percentile (0.456), can be considered relatively high.
Finally, Table
7 reports the metric results for the cluster identified by the tags “coronavirus”, “deaths”, and “meningitis”. In this case, we can observe that very few objects from the earlier window are projected there, leading to a low maximum value of
Coverage. Thus, an occurrence of the
Emerging topic is present: it corresponds, in early 2020, to the first news related to the Coronavirus pandemic.
7.5 Impact of Covid-19 on News Clusters
In order to evaluate the impact of the Coronavirus pandemic and how it reflects on our data and on reports for the city of Rome, we recall here several important dates related to the outbreak and evolution of the pandemic. According to the
World Health Organization (
WHO), on December 31st, 2019, several health authorities around the world contacted WHO seeking additional information about this “viral pneumonia”. However, it was not until January 31st, 2020, that the first two positive cases of Coronavirus in Italy were reported at the Spallanzani hospital in Rome. As time elapsed, more cases were reported, and the first localized lockdown was declared in the Lombardia region on February 23rd. On March 9th, a country-wide lockdown was put into action, and kept until May 18th. In the bottom part of Figure
8, it is depicted the popularity of the search query “
coronavirus” in Google Search,
3 and the mentioned dates have been clearly marked with red lines. It is quite evident that the peaks in the chart are associated with the events at the marked dates and, in particular, with the worsening of the pandemic situation. In the following, we take an in-depth look at how our system can help understand the evolution of clusters over the period of the pandemic diffusion.
We focus on the three most relevant topics, namely “Waste”, “Theft”, and “Coronavirus”, and identify the clusters whose labels belong to these broad semantic areas. Figure
8 (top) shows how such topics evolve from December 2019 to May 2020 by representing the most relevant clusters and the links between them along with the values of the metrics introduced in this article. Table
8 reports the labeling of the clusters involved in the analysis.
The first cluster labeled as “Coronavirus”-related (i.e., C8) coincides with the period with the first-two positive cases reported in Italy. In the previous section we have shown, by analyzing Table
7, that this cluster represents the emergence of the coronavirus topic. Clearly, as the pandemic evolved, also our identified clusters evolved accordingly. For example, clusters C9 and C10 roughly coincide at least temporally with the Lombardy and Nation-wide lockdowns, respectively. Furthermore, if we look at the values for
Preservation and
Coverage for the clusters, they match the
Topic Expansion pattern. Moreover, clusters related to “Waste” and “Theft” topics show small variations in their cardinality along the period from December 2019 to February 2020. It is interesting to see how objects belonging to clusters C3 and C6, which are related to “Waste” and “Theft” topics, respectively, merge in the subsequent window into cluster C10 (related to the “Coronavirus” topic), with a high value of
Preservation. In fact, cluster C10 groups up both news directly related to the “Coronavirus”, and news concerning other topics within the general pandemic context (e.g., how the pandemic affected garbage collection and led to the appearance of wild animals). In the subsequent window, i.e., in April 2020, cluster C10 was further split into four clusters: cluster C7 related to the “Theft” topic, and clusters C11, C12, and C13 related to the “Coronavirus” topic. By inspecting each cluster and its news, it is clear that all the clusters are related to the pandemic, but each one provides a different perspective. Specifically, cluster C11 contains news about recently detected coronavirus cases and restrictions to mitigate the increase of these new cases, whereas in C12, the articles describe the new delivery ways offered by restaurants and supermarkets for their products, and some services offered to doctors and nurses; cluster C13 deals with all the municipality works on facilities, and improvements to roads to promote the use of bikes, scooters, or eco-friendly ways of commuting. Some examples of these news articles are reported in Table
9.
Notably, only the labeling of cluster C13 precisely matches the actual content, with tags like “road works” and “ring road” that relate to the semantic area of urban mobility. In C11 and C12, instead, the secondary tags (“mice” and “meningitis”) are rather distant from the cluster centroid tags in the embedding space (cosine similarity lower than 0.5). This highlights the importance of using an accurate reference tag dictionary, which should cover a range of semantic areas as wide as possible: The current set, for example, does not include terms in the field of food and restaurants, and this likely motivates the apparently whimsical labeling of cluster C12.
Finally, we observe an occurrence of the Topic Fusion pattern in the last window (May 2020), involving clusters C11, C12, and C14.
7.6 GeoSpatial Analysis
The proposed approach can take advantage of geo-localization of news, and can help in getting to a more focused analysis by considering specific areas of the city, and visualize the evolution of the news pertinent to such areas. Here, we provide an example of the approach exploited to visualize the clusters of news for the same part of the city in two different periods of time. We chose the area around the Termini Railway Station (the biggest train station in Rome), which is densely populated and was deeply affected by the pandemic. We visualized the news clusters for two time windows, one directly before the enforcement of the Nation-wide lockdown, and the other during the lockdown itself. Figure
9 shows the relative results: most of the news concerning the city area before the major lockdown, represented by clusters C6 and C3, refer to small crimes (e.g., thefts, robberies, car thefts), which unfortunately often affect areas of this kind in major cities, and to waste problems and road accidents, respectively. Cluster C9 shows up as a minor structure linked to the Coronavirus pandemic likely due to the initial restrictions enforced in February. Conversely, in the next period, i.e., during the first Nation-wide lockdown, reports of robbery and waste emergency suddenly disappear, and most of the news articles are targeted toward the pandemic.
Whereas previous experiments allowed us to evaluate the method as a whole, narrower analyses like the one reported in this section may help provide a clearer view of how events unfold in specific city areas during specific periods of time.
8 Conclusions AND Future Works
In this work, we have proposed an approach that is able, on the basis of an analysis of online news, to monitor and track changes in daily life in different areas of the city, both in normal conditions and in the case of extraordinary events, where changes may be more noticeable and bear more complex consequences. The pipeline used in the supporting framework includes: (i) a Sentence-BERT pre-trained language model to obtain news article-level representations that are semantically relevant, and can be compared in terms of proximity by cosine similarity; (ii) a modified version of the TSF-DBSCAN clustering algorithm for grouping up articles that cover similar topics; (iii) a method for the automatic labeling of the identified clusters; and (iv) a set of metrics aimed at relating clusters generated in consecutive windows.
We evaluated our approach through an extensive case study on the city of Rome during the Covid-19 pandemic, and specifically over the first Nation-wide lockdown. The experimentation performed with the deployment of our news-based framework and the related methodology for knowledge extraction from discovered clusters can give us a few interesting insights.
First, we have observed that the methodology is effective in isolating the specificity of the pandemic, and the consequent lockdown, by means of the clusters of news in specific moments in time and in specific areas. We also showed how a single neighborhood of the city, and particularly one that was deeply affected by the lockdown, drastically changed in terms of news reported for the area.
Second, we adapted a fuzzy density-based clustering algorithm for data streams, i.e., TSF-DBSCAN, to manage high-dimensional data and to automatically tune its parameters from the actual density of objects. The Dimensionality Reduction sub-module supports the new version of TSF-DBSCAN (i.e., TSF-DBSCAN.news) to properly deal with word and sequence embeddings produced with state-of-the-art Language Models, characterized by a high dimensionality. We have shown that the UMAP algorithm is an effective choice for reducing the number of dimensions of distributed representations of texts, thus making TSF-DBSCAN.news a viable solution for high-dimensional data and NLP applications. Furthermore, the automatic tuning enables TSF-DBSCAN.news to autonomously manage streams of data with possibly changing distributions.
Third, the metrics purposely introduced in this work for monitoring the evolution of clusters over time, especially in this context, help analyze how the whole partition, as well as any single clusters, change over time. The three metrics, namely Purity, Coverage, and Preservation, used either singularly or in conjunction with one another, allow uncovering evolutionary relationships across clusters in adjacent time windows, thus providing the ability to track how the profiles of the city areas change along the time.
Finally, it is worth underlining a major advantage of our approach: no supervised models are used in the overall pipeline; this allows us to avoid the costly and time-consuming labeling of single pieces of news as pertaining to certain topics. In the future, we plan to extend the proposed framework to make it operate in distributed environments (e.g., with nodes dedicated to specific tasks and/or specific cities). Moreover, we aim at studying how to improve the labeling technique as well, in order to automatically identify highly informative tags across time. In addition to this, the obtained results could pave the way to proficiently support different kinds of applications that may benefit from the outcome of city area profiling, thus enabling the integration of an additional layer of information.
In conclusion, the proposed method can be an effective tool for monitoring changes in the life of a given city by the plain use of newspaper information readily available on the web. Moreover, the framework turns to be effective also in understanding how changes take place during crisis times like the global Covid-19 pandemic, how such changes impact the life in the city, and it potentially provides insights to organizations, either public ones or others, to better cope with such hard times and to properly and effectively react.