TCMeta: a multilingual dataset of COVID tweets for relation-level metaphor analysis

Mojca Brglez^1,2,
Omnia Zayed² &
Paul Buitelaar²

886 Accesses
Explore all metrics

Abstract

The COVID pandemic spurred the use of various metaphors, some very common and universal, others depending on the language, country and culture. The use of metaphors by the general public, especially in languages other than English, has not yet been sufficiently investigated, one of the reasons being the lack of resources and automatic tools for metaphor analysis. To fill this gap, we introduce TCMeta, a dataset of tweets annotated for metaphors around COVID-19, in two languages from ten different countries. The dataset contains metaphoric phrases covering four source domains. Furthermore, we introduce a semi-automatic methodology to annotate more than 2000 tweets in English and Slovene. To the best of our knowledge, this is the first multilingual semi-automatically compiled dataset of user-generated texts aimed at investigating metaphorical language about the pandemic. It is also the first Slovene dataset of tweets annotated for metaphors.

Disaster Tweets: Analysis from the Metaphor Perspective and Classification Using LLM’s

Influential Spanish Politicians’ Discourse of Climate Change on Twitter: A Corpus-Assisted Discourse Study

Article Open access 29 April 2023

Harnessing Indigenous Tweets: The Reo Māori Twitter corpus

Article Open access 14 February 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The COVID pandemic has been a pervasive issue since 2020, featuring as an almost inescapable topic in various types of discourse. Emerging in late 2019, the new virus had spread globally by early 2020 and drastically changed all aspects of our lives, e.g. how we conduct ourselves, how we dress, how we socialize. Furthermore, it changed healthcare, politics, the economy, education, the work environment, and communication. The virus’ unknown origins, mechanisms, means of spreading, short- and long-term effects on health and other potential impacts superimposed on us a complex and obscure topic replete with many uncertainties. To make sense of it, we have, among other coping mechanisms or adaptive strategies, frequently resorted to metaphors, much like we do when faced with other abstract, obscure and/or complex concepts. According to Conceptual Metaphor Theory (CMT, Johnson 1987; Lakoff & Johnson 1980) and its many ensuing developments (e.g. Semino, 2016; Gentner et al. 2001; Gibbs, 1994, 2017; Kövecses, 2005; Kövecses, 2020; Steen, 2011), metaphors are not only figures of speech on the linguistic level but can act as powerful tools for various communicative and/or cognitive goals. They enable us to present a certain topic (domain) that may be difficult to understand by making an analogy to a more understandable, known, familiar topic (domain). They also allow us to implicitly express emotions and opinions through the metaphor’s implications, e.g. one could just succinctly say COVID wave and mean a lot of COVID-infections that are uncontrollable, hard to treat and constrain, and as such undesirable.

With regard to COVID being a global phenomenon that trickled into people’s personal lives, investigating their (changing) attitudes and reactions to the arrival of the virus, its development, and the measures taken against it would allow us to get insights into both how people were experiencing the situation and also potentially act as a guide for health, media and political communication in the future. The COVID pandemic has been, unsurprisingly, investigated through numerous studies of social media relating to information and misinformation spread, sentiment analysis, human mobility tracking and others (a comprehensive review of studies, datasets and unresolved issues is provided by Huang et al. (2022). Because metaphors are such a powerful tool for representation and understanding, they are understandably also a worthwhile avenue for research, but apart from Wicke and Bolognesi (2021), Wicke and Bolognesi (2020), and Abdo et al. (2020) for English, posts by the general public on social media have not yet been sufficiently analysed for COVID-related metaphors, especially in less-resourced languages such as Slovene.

To shed light on the metaphorical language of social media users, and, foremost, to investigate potential differences between users of different languages and countries, our work focuses on the language in Twitter communication. The dataset consists of more than 2,000 tweets in Slovene and English, which are complemented with more than 4,000 annotations. The choice to use Twitter data was guided by various reasons. Twitter is a frequent choice among researchers, primarily because of the ease-of-access to data. Although Facebook is the most popular platform with the highest number of users globally,^{Footnote 1} the access to the information through their application programming interface (API)^{Footnote 2} is much more restricted. Access to Twitter data, on the other hand, is made easy and straightforward through a Twitter API^{Footnote 3} that allows developers and researchers to retrieve tweets with all associated metadata. Secondly, as a microblogging service, it is used primarily as a textual medium, whereas others may feature and encourage more visual content such as videos and images. The posts (tweets) are limited in length (now allowing up to 280 characters) which makes the posts brief and comparable with each other. For this study, we applied for Academic access which allows for fetching up to 10 million tweets. Nevertheless, Twitter’s Terms and Conditions stipulate that data collected through the API can only be redistributed in the form of Tweet IDs.

2 Background

Metaphor has been traditionally recognized as a figure of speech on the level of language. In that capacity, metaphor equals (re-)naming, transferring the name of one thing to another thing on the basis of some similarity. In the contemporary view based on Conceptual Metaphor Theory (CMT; Lakoff & Johnson, 1980; 2003), metaphor is seen not just as a linguistic, but a cognitive phenomenon. CMT posits that metaphorical expressions (on the language level) stem from certain recurring patterns, called conceptual metaphors. Metaphor in CMT is defined as a primarily cognitive device that maps concepts and structures from a source domain, which is typically more concrete and familiar, to a target domain which is typically more abstract and unfamiliar. In this way, it allows us to understand one domain of experience in terms of another. Metaphor as a cognitive phenomenon manifest itself in linguistic metaphors (or other forms such as visual metaphors (Steen, 2018). Thus, metaphorical expressions are regarded as surface manifestations of underlying conceptual metaphors. For example, the following sentences contain metaphorical expressions that can be regarded as stemming from the conceptual metaphor LOVE is a JOURNEY^{Footnote 4}:

They are at a crossroads in their relationship.
This relationship isn’t going anywhere.
They’re in a dead-end relationship.^{Footnote 5}

Our basic experiences of journeys and trips allow us to easily understand the situations referred to by the examples. We realize that we have to decide where to turn or how to proceed at a crossroads, we know progression happens by going somewhere, and we know that there is no way forward in a dead-end street.

CMT has been an extremely influential view that boosted metaphor research in the past few decades. However, the latter still faces a lot of challenges. For one, the identification of metaphors in language has for a long time lacked a robust identification procedure that would allow researchers to analyse metaphors in language. For English, a group of researchers has developed MIP (Metaphor Identification Procedure; Pragglejaz Group, 2007) which has later evolved into a more detailed MIPVU (MIP Vrije Universitet, Steen, 2010). However, such identification approaches require a lot of time and effort from annotators, as it involves reading the whole text, separating it into lexical units and only then deciding for each of the units if it is metaphorical or not. To alleviate the effort with the manual annotation, researchers can apply more targeted approaches from corpus linguistics that involve searching for only a specified set of words in the corpus or use automatic computational methods. Another open problem, at the core of metaphor analysis, is the ascription of source and target domains that form conceptual metaphors. The term “domain” is not clearly defined by the originators of CMT, however, in cognitive linguistics, it is defined as ”a coherent area of conceptualization relative to which semantic units may be characterized” (Langacker, 1987, p. 488). The concept can seem quite similar to the concept of lexical field. However, as Cameron (2003) notes, contrary to the latter, domains are “not just a collection of concepts or entities” but also encompass the various meaningful relations between the entities. That is, while lexical fields group words and phrases on a linguistic, lexical level, the concepts the words evoke are grouped and interconnected on a much richer conceptual and cognitive level. It is often unclear at which precise level to formulate these domains that construe conceptual metaphors (Cameron, 2003; Kövecses, 2017; Kövecses, 2020), and some metaphor researchers may instead use other, more specific conceptual constructs such as mental space, scene, frame, script, schema etc. In this study, the metaphorical analysis is made on the general level of domains as defined by Kövecses (2020). That is, we selected broad conceptual domains, i.e. WAR, STORM, TSUNAMI and MONSTER, and captured them via the proxy of a lexical field, a group of lexical units that evoke those domains. At present, we do not try to identify or distinguish between particular frames or other conceptual constructions that may be instantiated via metaphors, which involve conceptually richer information, with specific roles and relations.

3 Related work

In this section, we present previous work that relates to three main aspects of our study: (1) metaphor identification, (2) existing metaphor datasets, and (3) studies of COVID-related metaphors in particular.

3.1 Metaphor identification approaches

Linguistic and conceptual metaphor identification approaches in non-annotated corpora vary in their methods and scope. First, we can differentiate between what Brdar et al. (2020) call ‘census’ and ‘sampling’ approaches. The first take a bottom-up approach, starting from text, and identifying metaphorically used words, either manually or automatically. A completely manual approach, such as the MIPVU procedure (Steen, 2010), involves careful reading of texts in their entirety, separating each text into lexical units, and deciding for each unit if it is used metaphorically or not. This approach is only possible for smaller corpora or by enlisting a large number of annotators. The second, sampling approaches, adopt a top-down perspective. Here some sort of filtering is applied to texts, either by looking for examples based on metaphorical signals,^{Footnote 6}(Goatly, 1997) or by limiting the search to selected conceptual metaphors (or domains) (Stefanowitsch, 2006), and supplementing the results with manual annotation. It can involve searching for source domain vocabulary, searching for target domain vocabulary, or searching for sentences (or other units) containing lexical items from both target and source domains. The latter is considered to provide a good balance between coverage, accuracy, time and effort compared to other manual or semi-automatic approaches (Stefanowitsch, 2006, p. 4).

From a computational perspective, many efforts have been made, especially in English, to develop methods to identify metaphors (or figurative language in general) through more automatic means. Extensive reviews of metaphor processing are provided in Shutova (2011) and Rai and Chakraverty (2020). Earlier approaches include those using hand-coded knowledge (Fass, 1991), language resources (Gedigian et al., 2006; Krishnakumaran & Zhu, 2007), psycholinguistic features such as abstractness of words (Turney et al. 2011), similarity- or relatedness-based clustering (Birke & Sarkar, 2006; Shutova et al. 2010), and topic modelling (Broadwell et al., 2013; Heintz et al., 2013; Strzalkowski et al., 2013). From the development of deep learning with neural networks, the field of automatic metaphor identification has shifted its focus to supervised methods that involve training neural models on metaphor-annotated datasets (e.g. Choi et al., 2021; Do Dinh & Gurevych, 2016; Haagsma & Bjerva, 2016; Liu et al. 2020; Rai et al. 2016; Zayed et al. 2020b). Computational approaches to metaphor identification also differ in their level of processing, which can be carried out on the level of words, relations, specific constructions, or sentences. In the first case, metaphoricity is ascribed to individual words, so the task usually involves labelling every token in the text (as in e.g. Choi et al., 2021; Do Dinh & Gurevych, 2016). In relation-level approaches, groups of syntactically related words are considered, usually containing expressions from both source and target domains. Most approaches, such as those by Shutova et al. (2010) and Shutova et al. (2016) tackle VERB-NOUN relations where the verb is metaphorical, and others such as Tsvetkov et al. (2014), Turney et al. (2011), Gutiérrez et al. (2016), Bizzoni et al. (2017) focus on ADJ-NOUN relations where the adjective is metaphorical. Some address both relation types (Rei et al. 2017; Zayed et al. 2018; Zayed et al., 2020b). However, there are also other common constructional patterns identified in corpus studies (Sullivan, 2013), including copula constructions (NOUN is NOUN, e.g. COVID is war), prepositional constructions (NOUN of NOUN, e.g. wave of poverty), domain constructions where the noun is metaphorical (ADJ NOUN, e.g. political monster). These have attracted only a few computational endeavours (Dodge et al. 2015; Krishnakumaran & Zhu, 2007; Rai & Chakraverty, 2017) despite the usefulness of constructions in determining conceptual domains (Sullivan, 2013).

While the field of figurative language processing has made quite some progress in English and other well-resourced languages, low-resourced languages such as Slovene unfortunately lacks far behind. We are aware of only a few (semi)-automatic approaches. Although not specifically addressing metaphors, Škvorc et al. (2021) construct MICE, a neural model trained to discern figurative or literal usage of idiomatic phrases. Recently, Zwitter et al. (2022) investigate adapting the MICE model by transfer learning and use it to identify sentences containing metaphors in a corpus of migration-related news. In a semi-automatic approach, Brglez et al. (2021) looked for COVID is WAR metaphors in news discourse by extending the lexical field of WAR using word embeddings and thus capturing a wider set of items coming from the source domain. Computational metaphor processing in Slovene is thus still in its early stages, one reason for it being the lack of linguistic resources. In the next chapter, we discuss the availability of datasets in both English and Slovene.

3.2 Metaphor datasets

There is only a small number of metaphor datasets available that can be used either for large-scale linguistic analysis or to train deep-learning-based models for automatic metaphor identification. The subsections below describe the existing English and Slovene metaphor datasets.

3.2.1 English datasets

The largest and most widely used corpus, especially for metaphor identification, is the Vrije Universiteit Amsterdam Metaphor Corpus (VUAMC Steen, 2010). It comprises English texts from four registers and 190,000 words and identifies linguistic metaphors on the word-level of various part-of-speech types (verbs, adjectives, nouns, adverbs, and prepositions). However, it has certain limitations relating to metaphor analysis as it deals with word-level metaphors only. As opposed to relation-level approaches (Zayed et al., 2018) that try to capture both the source domain and target domain in one phrase or syntactic relation, VUAMC only contains annotations for metaphoric expressions of the source domain. It does not relate them to their possible referents (expressions of the target domain) and is also not annotated with conceptual domains. There are a few exceptions of English corpora or datasets that do account for phrase-level metaphors and/or conceptual domains. One group of research outputs includes the five studies (Dodge et al., 2015; Gordon et al., 2015; Levin et al., 2014; Mohler et al., 2016; Shaikh et al., 2014) under the umbrella of a IARPA project that focus on metaphors related to societal issues and governance. Levin et al. (2014) create lists of conventional conceptual metaphors from previous literature and research on metaphor, in which they enumerate various syntactic patterns and lexical markers. This allows them to identify around 7500 English sentences (but also Russian, Spanish, Farsi). Mohler et al. (2016) introduce LLC datasets that were either manually or automatically compiled, that focus on relation-level metaphors. Their approach focuses on so-called metaphoric constructions, which are syntactically related terms within a sentence that could relate to a source and a target domain. For 80,100 such pairs, they provide metaphoricity ratings, polarity and intensity ratings as well as domain mappings for approximately 20,000 metaphoric pairs in English, but also Spanish, Russian and Farsi. The free dataset is reduced to around 9,000 annotated pairs (available upon request). Similarly, Gordon et al. (2015) design an annotation scheme and annotate around 1,500 sentences in detail for ontological categories, frames, frame elements, and affective polarity by combining manual and automatic methods. Shaikh et al. (2014) are one of the rare computational approaches that deal with metaphor in a larger context than a sentence. Based on the selected target topic of interest (such as Democracy), they identify around 189,862 relevant passages (English) and assign metaphoricity and affect ratings to verbs, adjectives and nouns in the context window with the use of topic modelling, dependency parsing, corpus analysis, WordNet and conceptual resources. They also assign various proto-source domains to the metaphors found. Another large resource stemming from the same line of research (although it is not a text corpus that can be used as a dataset) is the MetaNet Wiki repository (Dodge et al., 2015), which was also constructed based on known conceptual metaphors, on the basis of lexical sets and syntactic constructions. A separate endeavour to annotate metaphors with conceptual domains is Shutova and Teufel (2010), where they ascribe source and target domains to verbs in 761 sentences coming from various domains and genres (BNC), altogether 164 verb metaphors.

There are other approaches that try to capture syntactic constructions and deal with metaphor on the level of relations, which would allow easier identification of source and target domain terms. However, the studies below do not (yet) try to assign conceptual metaphors. This line of research usually focuses on one to three constructions, the most common are VERB-NOUN and ADJECTIVE-NOUN constructions, in which the verb or the adjective is metaphorical and the noun acts as the target domain referent. Turney et al. (2011), Tsvetkov et al. (2014), and Gutiérrez et al. (2016) collect adjective-noun constructions in which the adjective is literal or metaphoric. The sets include 1768 and 8592 adjective-noun pairs, respectively. Shutova (2010) constructs a small set of 62 verbs with verb-subject and verb-object relations. To develop automatic metaphor identification, Shutova et al. (2016) adapt the MOH metaphor corpus (Mohammad et al. 2016) to MOH-X with explicit relations of metaphoric verbs to either a subject or a direct object, resulting in 647 verb-noun pairs out of which 316 are labelled metaphorical. Another more recent dataset, also going into the domain of user-generated texts, is Zayed’s tweet dataset (Zayed et al., 2019). It contains around 2,500 tweets with metaphoric verbs paired with their object. Some studies have also adapted the previous existing datasets to fit phrase-level approaches. Parde and Nielsen (2018) created a dataset of phrase-level metaphors sampled from the VUA corpus to provide novelty annotations. It contains around 18,000 metaphoric word pairs (containing V, ADJ, ADV, N, and PP metaphors). Zayed et al. (2020a) extend the dataset by Tsvetkov et al. (2014) with 1,800 tweets to provide context for the original ADJ-N metaphor pairs, and determine the subject or object relation for metaphoric verbs in 6000 sentences taken from VUAMC and TroFi (Birke & Sarkar, 2007).

Another subdomain of metaphor datasets concerns user-generated content on social media. Apart from Zayed et al. (2019) and Zayed et al. (2020a), few other social media datasets exist that involve annotations for metaphors or other types of figurative language in computer-mediated user discourse. The dataset by Ghosh et al. (2015), Li et al. (2014) was constructed for a SemEval2015 task and is split into various figurative categories based on the hashtags of the tweets and expansion with LSA. Of these, 2000 tweets are labelled as metaphoric. Jang et al. (2014) annotate posts from an online breast cancer support group, a forum for gang members and a forum for online course participants (altogether 314 sentences). The work was continued by Jang et al. (2015), which resulted in around 2500 annotated posts with literal and metaphoric uses of 7 selected words. In a more specific vein of research, Yadav et al. (2020) construct a dataset of 3738 depressive tweets, where they primarily annotate examples with depression symptoms but also sarcasm and metaphor.

3.2.2 Slovene metaphor datasets

Slovene, a language spoken by approximately 2 million people, is a less-resourced language, reflected also in its availability of metaphor datasets. Currently, only two metaphor corpora have been published, one was released in 2020 and another one in 2022. The KOMET corpus (Antloga, 2020a) was developed to parallel the effort of the VUAMC corpus (Steen, 2010) in English, and is thus similar in size, genre makeup and annotation schema: it contains around 200,000 words coming from journalistic, fiction and on-line texts. In addition, it also contains semantic/conceptual annotations and a separate label for metonymy. The metaphorically used words are (for the most part) annotated with one of 67 semantic frames. However, the corpus is annotated on a word-level, meaning no connection is made between the expression of the source domain (metaphor) and its target (the expression the metaphor refers to). Another corpus of metaphors released only recently, the G-KOMET corpus (Antloga & Donaj, 2022), is an upgrade of KOMET, as it extends the genre coverage to include spoken texts. Similar to the VUAMC, both of these were designed as a general corpus not specific to a particular topic, and thus allow metaphor analysis on a broader level.

3.3 Research relating to metaphors on COVID

The COVID pandemic has been a difficult, continuous and ever-evolving issue. From a point of view of linguistic and social studies, such events often produce interesting metaphors which give insight into how such situations are experienced and understood. A very common and conventional conceptual metaphor for disease-related events is ILLNESS IS WAR, attested in linguistic studies on Zika (Ribeiro et al. 2018), SARS (Chiang & Duann, 2007; Ibrahim, 2007; Wallis & Nerlich, 2005), AIDS and cancer (Sontag, 1977). The WAR domain is nowadays frequently used for a wide array of topics (Flusberg, Matlock, & Thibodeau, 2018), such as politics, sports, and societal issues. The metaphorical framing of the COVID pandemic and its various developments has also already elicited a lot of linguistic studies: in media (e.g. Brglez et al., 2021; Busso & Tordini, in press; Fernández-Pedemonte et al. 2020; Kalinin, 2021; Zhang et al. 2022), political discourse and health communication (e.g. Castro Seixas, 2021; Charteris-Black, 2021; Papamanoli & Kaniklidou, 2022), children’s books (Muelas-Gil, 2022), scientific articles (Dar, 2021), and, to an extent, also user-generated content such as Twitter (Abdo et al., 2020; Wicke & Bolognesi, 2020, 2021). The studies mostly focus on and reaffirm the predominance of the conceptual metaphor ILLNESS IS WAR. On the other hand, some linguists and social scientists show that many other alternative frames are possible, if not also more suitable (Hanne, 2022; Olza et al. 2021; Pérez-Sobrino et al. 2022; Semino, 2021; Wicke & Bolognesi, 2020).

A large proportion of these studies have investigated communication of COVID by politicians or the media, while less attention has been paid to the linguistic expression and understanding of COVID by the general public, i.e. in user-generated content on social media, with few exceptions. Colak (2022) asked Turkish users of Facebook, Instagram and Twitter, users to provide a post completing the prompt “COVID-19 is like _ because _”. They collected 125 responses and 84 valid metaphors covering a wide array of source domains. COVID was most frequently presented as an unwanted relative, love, an ex-partner, gossip, and cancer. The authors notice they did not observe the most frequent metaphor used by media and politicians at that time, which was “war” or “struggle”. By using the same prompt, Gök Uslu and Kara (2022) collected 210 responses from Turkish participants and collected 43 different metaphors with a wide array of domains. Among the 7 subcategories of metaphors, the most frequent one frames the virus as something deadly/dangerous. They also observe difference in particular frame use depending on the gender and medical history.

In large-scale social media studies that indirectly collected data, Abdo et al. (2020) analyse 14 days of data from Twitter using keywords such as “Corona”, “Coronavirus”, “COVID-19” and their synonyms at the start of the pandemic. Part of their study is also to detect metaphors by comparing the lexis of tweets to the MetaNet repository (Dodge et al. (2015)) of known conceptual metaphors. The most frequent detected metaphor is DISEASE TREATMENT IS WAR. However, the focus of the paper is not on metaphors so the metaphor identification procedure is not clearly explained, secondly, the authors do not investigate other non-conventionalized metaphors. Wicke and Bolognesi (2020) collect English tweets published in the 14-day period in March and April 2020 using a set of COVID-related hashtags. To balance the corpus and make it more representative of the general population, they retain only the first tweet of a user per day. Using LDA topic modelling, they determine the most prevalent themes in the corpus. By compiling a list of 91 war-related words, they identify around 5.0% of tweets that contain WAR framing. By classifying tweets into the LDA-discovered topics, they conclude that the WAR frame is mostly used to talk about the treatment and proposed measures, but lacks presence in topics that refer to more social or personal aspects. They also explore the use of alternative frames STORM, MONSTER and TSUNAMI but find these are much less frequent in their data. However, their study only looks at the “surface” layer of words, collecting frequency data of potentially metaphorical seed words but not deliberating on their actual metaphoricity in context.

In a subsequent study, Wicke and Bolognesi (2021) include a somewhat larger time span of tweets, namely from March 20 to July 1, 2020. They investigate the temporal change of topics related to COVID-19, the sentiment, subjectivity and figurative framing using the WAR frame. In the part focused on figurative framing, they analyse the frequency of war-related lexis overall and in three intervals. Their study finds that the distribution is not constant and that the use of the WAR frame slowly diminishes after time. On the other hand, in the last interval, they also see a rise of specific war-related words but determine these are mostly used literally, relating to real-world violent events in the US.

A limitation to most studies of metaphors in COVID discourse so far, including in particular the discourse on Twitter, is that they are limited to the initial phases of the pandemic, i.e. based on data produced in 2020. In our study, we are also interested in the overall development of metaphors through time (or at least a wider time frame), and expect to see metaphors evolving, dying, adapting, emerging, becoming more or less popular etc. Studies have often also been limited to just one language, more often than not also only one country. As the studies show, metaphor frequency and selection of particular conceptual domains can depend on several factors, from the time period relative to the course of events, type of discourse, individual personal factors such as gender and medical history, to country and culture. For instance, the use of metaphors for the SARS epidemic was different in the countries where the epidemic had stronger effects than in those that experienced it from a distance (Wallis & Nerlich, 2005). It has also been shown that certain frames are generally more likeable than others, so the frames used for COVID differ depending on the specific country context (Brugman et al., 2022). War-related metaphors were more or less avoided in Germany (Jaworska, 2020; Paulus, 2020) as well as in New Zealand, where the government communication relied more on the frames of LEVEL, BUBBLE and TEAM (Kearns, 2021). Our study takes inspiration from the work by Wicke and BolognesiWicke and Bolognesi that took a quantitative approach and used semi-supervised methods to analyse the use and pervasiveness of different conceptual domains in framing the COVID pandemic on Twitter. We complement past findings by overcoming some of the limitations of previous studies: the advantages of our approach are multilinguality and, especially, the inclusion of Slovene as a less-resourced language, more conceptual domains, investigating user-generated content, a wider time-span of data, and distinguishing between different English-speaking countries. Additionally, our relation-level approach focuses on metaphorical expressions where the metaphor is conveyed by an adjective or a noun. Albeit prepositions and verbs are the parts of speech responsible for the largest portion of metaphors according to corpus studies (Antloga, 2020b; Cameron, 2003; Krennmayr & Steen, 2017) they have also been found to be less novel—more conventional (Do Dinh et al. 2018) and less deliberate (Reijnierse et al. 2019).

4 Methodology

In this section, we describe the methodology of compiling the dataset.

Sections 4.1 and 4.2 describe the collection and filtering of tweets for English and Slovene, respectively. In Sect. 4.3, we describe the various issues related to data normalization and cleaning. In Sect. 4.4, we address linguistic processing including tokenization, sentence segmentation as well as lemmatization and part-of-speech tagging with automatic linguistic pipelines. Finally, in Sect. 4.5, we describe our approach of extracting and annotating metaphoric expressions from the dataset.

4.1 English data collection

We employed the publicly available GeoCOV19Tweets Dataset (Lamsal, 2020a; 2021). This is a filtered sample from the original, larger COV19Tweets Dataset (Lamsal, 2020a; 2021) with only tweets that have geolocation information. It consists of IDs of tweets that contained any of the COVID-19-related keywords or hashtags upon their publication on the Twitter platform, starting from March 19, 2020. The initial set of keywords contained the words “corona, #corona, coronavirus, #coronavirus” but was later expanded to include 46 different keywords or hashtags. The cut-off point for our dataset is January 31, 2022. We use the Hydrator (2020) tool to hydrate the tweets via the Twitter API, which includes retrieving the full text of the tweet and all its associated metadata, including tweet text, user ID, geolocation, information on retweets and likes, time of creation etc. Out of the 463,903 IDs provided, 401,452 (86.54%) could be retrieved, the rest being already removed.

In the next step, we process the dataset to ensure approximately the same contributions by different users, following Wicke and Bolognesi (2021) to balance more productive and less productive users. We discard retweets and keep only one tweet per user per day according to the user ID information provided with each tweet. We divide the English dataset, based on the country as identified by the tweet metadata. There are more than 200 countries present in the dataset, but the vast majority of countries contributes less than 1% of the total number of tweets. The distribution of countries is depicted in Fig. 1.

Table 1 English subdataset sizes

TCMeta: a multilingual dataset of COVID tweets for relation-level metaphor analysis

Abstract

Similar content being viewed by others

Disaster Tweets: Analysis from the Metaphor Perspective and Classification Using LLM’s

Influential Spanish Politicians’ Discourse of Climate Change on Twitter: A Corpus-Assisted Discourse Study

Harnessing Indigenous Tweets: The Reo Māori Twitter corpus

1 Introduction

2 Background

3 Related work

3.1 Metaphor identification approaches

3.2 Metaphor datasets

3.2.1 English datasets

3.2.2 Slovene metaphor datasets

3.3 Research relating to metaphors on COVID

4 Methodology

4.1 English data collection

4.2 Slovene data collection

4.3 Data cleaning and normalization

4.4 Linguistic processing

4.5 Domain-driven metaphor extraction

5 Manual annotation

6 Results and discussion

7 Conclusion

Data availibility

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Consent to participation

Consent for publication

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation