Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

Atul Kr. Ojha, A. Seza Doğruöz, Harish Tayyar Madabushi, Giovanni Da San Martino, Sara Rosenthal, Aiala Rosá (Editors)

Anthology ID:: 2024.semeval-1
Month:: June
Year:: 2024
Address:: Mexico City, Mexico
Venue:: SemEval
SIG:: SIGLEX
Publisher:: Association for Computational Linguistics
URL:: https://aclanthology.org/2024.semeval-1
DOI:
Bib Export formats:: BibTeX MODS XML EndNote
PDF:: https://aclanthology.org/2024.semeval-1.pdf

pdf bib abs
CUNLP at SemEval-2024 Task 8: Classify Human and AI Generated Text
Pranjal Aggarwal | Deepanshu Sachdeva

This task is a sub-part of SemEval-2024 competition which aims to classify AI vs Human Generated Text. In this paper we have experimented on an approach to automatically classify an artificially generated text and a human written text. With the advent of generative models like GPT-3.5 and GPT-4 it has become increasingly necessary to classify between the two texts due to various applications like detecting plagiarism and in tasks like fake news detection that can heavily impact real world problems, for instance stock manipulation through AI generated news articles. To achieve this, we start by using some basic models like Logistic Regression and move our way up to more complex models like transformers and GPTs for classification. This is a binary classification task where the label 1 represents AI generated text and 0 represents human generated text. The dataset was given in JSON style format which was converted to comma separated file (CSV) for better processing using the pandas library in Python as CSV files provides more readability than JSON format files. Approaches like Bagging Classifier and Voting classifier were also used.

In this system paper for SemEval-2024 Task 1 subtask A, we present our approach to evaluating the semantic relatedness of sentence pairs in nine languages. We use a mix of statistical methods combined with fine-tuned BERT transformer models for English and use the same model and machine-translated data for the other languages. This simplistic approach shows consistently reliable scores and achieves above-average rank in all languages.

pdf bib abs
L3i++ at SemEval-2024 Task 8: Can Fine-tuned Large Language Model Detect Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text?
Hanh Thi Hong Tran | Tien Nam Nguyen | Antoine Doucet | Senja Pollak

This paper summarizes our participation in SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. In this task, we aim to solve two over three Subtasks: (1) Monolingual and Multilingual Binary Human-Written vs. Machine-Generated Text Classification; and (2) Multi-Way Machine-Generated Text Classification. We conducted a comprehensive comparative study across three methodological groups: Five metric-based models (Log-Likelihood, Rank, Log-Rank, Entropy, and MFDMetric), two fine-tuned sequence-labeling language models (RoBERTA and XLM-R); and a fine-tuned large-scale language model (LS-LLaMA). Our findings suggest that our LLM outperformed both traditional sequence-labeling LM benchmarks and metric-based approaches. Furthermore, our fine-tuned classifier excelled in detecting machine-generated multilingual texts and accurately classifying machine-generated texts within a specific category, (e.g., ChatGPT, bloomz, dolly). However, they do exhibit challenges in detecting them in other categories (e.g., cohere, and davinci). This is due to potential overlap in the distribution of the metric among various LLMs. Overall, we achieved a 6th rank in both Multilingual Binary Human-Written vs. Machine-Generated Text Classification and Multi-Way Machine-Generated Text Classification on the leaderboard.

pdf bib abs
nicolay-r at SemEval-2024 Task 3: Using Flan-T5 for Reasoning Emotion Cause in Conversations with Chain-of-Thought on Emotion States
Nicolay Rusnachenko | Huizhi Liang

Emotion expression is one of the essential traits of conversations. It may be self-related or caused by another speaker. The variety of reasons may serve as a source of the further emotion causes: conversation history, speaker’s emotional state, etc. Inspired by the most recent advances in Chain-of-Thought, in this work, we exploit the existing three-hop reasoning approach (THOR) to perform large language model instruction-tuning for answering: emotion states (THOR-state), and emotion caused by one speaker to the other (THOR-cause). We equip THORcause with the reasoning revision (RR) for devising a reasoning path in fine-tuning. In particular, we rely on the annotated speaker emotion states to revise reasoning path. Our final submission, based on Flan-T5-base (250M) and the rule-based span correction technique, preliminary tuned with THOR-state and fine-tuned with THOR-cause-rr on competition training data, results in 3rd and 4th places (F1-proportional) and 5th place (F1-strict) among 15 participating teams. Our THOR implementation fork is publicly available: https://github.com/nicolay-r/THOR-ECAC

pdf bib abs
StFX-NLP at SemEval-2024 Task 9: BRAINTEASER: Three Unsupervised Riddle-Solvers
Ethan Heavey | James Hughes | Milton King

In this paper, we explore three unsupervised learning models that we applied to Task 9: BRAINTEASER of SemEval 2024. Two of these models incorporate word sense disambiguation and part-of-speech tagging, specifically leveraging SensEmBERT and the Stanford log-linear part-of-speech tagger. Our third model relies on a more traditional language modelling approach. The best performing model, a bag-of-words model leveraging word sense disambiguation and part-of-speech tagging, secured the 10th spot out of 11 places on both the sentence puzzle and word puzzle subtasks.

pdf bib abs
hinoki at SemEval-2024 Task 7: Numeral-Aware Headline Generation (English)
Hinoki Crum | Steven Bethard

Numerical reasoning is challenging even for large pre-trained language models. We show that while T5 models are capable of generating relevant headlines with proper numerical values, they can also make mistakes in reading comprehension and miscalculate numerical values. To overcome these issues, we propose a two-step training process: first train models to read text and generate formal representations of calculations, then train models to read calculations and generate numerical values. On the SemEval 2024 Task 7 headline fill-in-the-blank task, our two-stage Flan-T5-based approach achieved 88% accuracy. On the headline generation task, our T5-based approach achieved RougeL of 0.390, BERT F1 Score of 0.453, and MoverScore of 0.587.

pdf bib abs
T5-Medical at SemEval-2024 Task 2: Using T5 Medical Embedding for Natural Language Inference on Clinical Trial Data
Marco Siino

In this work, we address the challenge of identifying the inference relation between a plain language statement and Clinical Trial Reports (CTRs) by using a T5-large model embedding. The task, hosted at SemEval-2024, involves the use of the NLI4CT dataset. Each instance in the dataset has one or two CTRs, along with an annotation from domain experts, a section marker, a statement, and an entailment/contradiction label. The goal is to determine if a statement entails or contradicts the given information within a trial description. Our submission consists of a T5-large model pre-trained on the medical domain. Then the pre-trained model embedding output provides the embedding representation of the text. Eventually, after a fine-tuning phase, the provided embeddings are used to determine the CTRs’ and the statements’ cosine similarity to perform the classification. On the official test set, our submitted approach is able to reach an F1 score of 0.63, and a faithfulness and consistency score of 0.30 and 0.50 respectively.

pdf bib abs
CTYUN-AI at SemEval-2024 Task 7: Boosting Numerical Understanding with Limited Data Through Effective Data Alignment
Yuming Fan | Dongming Yang | Xu He

Large language models (LLMs) have demonstrated remarkable capabilities in pushing the boundaries of natural language understanding. Nevertheless, the majority of existing open-source LLMs still fall short of meeting satisfactory standards when it comes to addressing numerical problems, especially as the enhancement of their numerical capabilities heavily relies on extensive data.To bridge the gap, we aim to improve the numerical understanding of LLMs by means of efficient data alignment, utilizing only a limited amount of necessary data.Specifically, we first use a data discovery strategy to obtain the most effective portion of numerical data from large datasets. Then, self-augmentation is performed to maximize the potential of the training data. Thirdly, answers of all traning samples are aligned based on some simple rules. Finally, our method achieves the first place in the competition, offering new insights and methodologies for numerical understanding research in LLMs.

pdf bib abs
McRock at SemEval-2024 Task 4: Mistral 7B for Multilingual Detection of Persuasion Techniques In Memes
Marco Siino

One of the most widely used content types in internet misinformation campaigns is memes. Since they can readily reach a big number of users on social media sites, they are most successful there. Memes used in a disinformation campaign include a variety of rhetorical and psychological strategies, including smearing, name-calling, and causal oversimplification, to achieve their goal of influencing the users. The shared task’s objective is to develop models for recognizing these strategies solely in a meme’s textual content (Subtask 1) and in a multimodal context where both the textual and visual material must be analysed simultaneously (Subtasks two and three). In this paper, we discuss the application of a Mistral 7B model to address the Subtask one in English. Find the persuasive strategy that a meme employs from a hierarchy of twenty based just on its “textual content.” Only a portion of the reward is awarded if the technique’s ancestor node is chosen. This classification issue is multilabel hierarchical. Our approach based on the use of a Mistral 7B model obtains a Hierarchical F1 of 0.42 a Hierarchical Precision of 0.30 and a Hierarchical Recall of 0.71. Our selected approach is able to outperform the baseline provided for the competition.

pdf bib abs
Mashee at SemEval-2024 Task 8: The Impact of Samples Quality on the Performance of In-Context Learning for Machine Text Classification
Areeg Fahad Rasheed | M. Zarkoosh

Within few-shot learning, in-context learning(ICL) has become a potential method for lever-aging contextual information to improve modelperformance on small amounts of data or inresource-constrained environments where train-ing models on large datasets is prohibitive.However, the quality of the selected samplein a few shots severely limits the usefulnessof ICL. The primary goal of this paper is toenhance the performance of evaluation metricsfor in-context learning by selecting high-qualitysamples in few-shot learning scenarios. We em-ploy the chi-square test to identify high-qualitysamples and compare the results with those ob-tained using low-quality samples. Our findingsdemonstrate that utilizing high-quality samplesleads to improved performance with respect toall evaluated metrics.

pdf bib abs
Puer at SemEval-2024 Task 4: Fine-tuning Pre-trained Language Models for Meme Persuasion Technique Detection
Jiaxu Dao | Zhuoying Li | Youbang Su | Wensheng Gong

The paper summarizes our research on multilingual detection of persuasion techniques in memes for the SemEval-2024 Task 4. Our work focused on English-Subtask 1, implemented based on a roberta-large pre-trained model provided by the transforms tool that was fine-tuned into a corpus of social media posts. Our method significantly outperforms the officially released baseline method, and ranked 7th in English-Subtask 1 for the test set. This paper also compares the performances of different deep learning model architectures, such as BERT, ALBERT, and XLM-RoBERTa, on multilingual detection of persuasion techniques in memes. The experimental source code covered in the paper will later be sourced from Github.

pdf bib abs
Puer at SemEval-2024 Task 2: A BioLinkBERT Approach to Biomedical Natural Language Inference
Jiaxu Dao | Zhuoying Li | Xiuzhong Tang | Xiaoli Lan | Junde Wang

This paper delineates our investigation into the application of BioLinkBERT for enhancing clinical trials, presented at SemEval-2024 Task 2. Centering on the medical biomedical NLI task, our approach utilized the BioLinkBERT-large model, refined with a pioneering mixed loss function that amalgamates contrastive learning and cross-entropy loss. This methodology demonstrably surpassed the established benchmark, securing an impressive F1 score of 0.72 and positioning our work prominently in the field. Additionally, we conducted a comparative analysis of various deep learning architectures, including BERT, ALBERT, and XLM-RoBERTa, within the context of medical text mining. The findings not only showcase our method’s superior performance but also chart a course for future research in biomedical data processing. Our experiment source code is available on GitHub at: https://github.com/daojiaxu/semeval2024_task2.

pdf bib abs
NRK at SemEval-2024 Task 1: Semantic Textual Relatedness through Domain Adaptation and Ensemble Learning on BERT-based models
Nguyen Tuan Kiet | Dang Van Thin

This paper describes the system of the team NRK for Task A in the SemEval-2024 Task 1: Semantic Textual Relatedness (STR). We focus on exploring the performance of ensemble architectures based on the voting technique and different pre-trained transformer-based language models, including the multilingual and monolingual BERTology models. The experimental results show that our system has achieved competitive performance in some languages in Track A: Supervised, where our submissions rank in the Top 3 and Top 4 for Algerian Arabic and Amharic languages. Our source code is released on the GitHub site.

pdf bib abs
BrainLlama at SemEval-2024 Task 6: Prompting Llama to detect hallucinations and related observable overgeneration mistakes
Marco Siino

Participants in the SemEval-2024 Task 6 were tasked with executing binary classification aimed at discerning instances of fluent overgeneration hallucinations across two distinct setups: the model-aware and model-agnostic tracks. That is, participants must detect grammatically sound outputs which contain incorrect or unsupported semantic information, regardless of whether they had access to the model responsible for producing the output or not, within the model-aware and model-agnostic tracks. Two tracks were proposed for the task: a model-aware track, where organizers provided a checkpoint to a model publicly available on HuggingFace for every data point considered, and a model-agnostic track where the organizers do not. In this paper, we discuss the application of a Llama model to address both the tracks. Find the persuasive strategy that a meme employs from a hierarchy of twenty based just on its “textual content.” Only a portion of the reward is awarded if the technique’s ancestor node is chosen. This classification issue is multilabel hierarchical. Our approach reaches an accuracy of 0.62 on the agnostic track and of 0.67 on the aware track.

Safe and reliable natural language inference is critical for extracting insights from clinical trial reports but poses challenges due to biases in large pre-trained language models. This paper presents a novel data augmentation technique to improve model robustness for biomedical natural language inference in clinical trials. By generating synthetic examples through semantic perturbations and domain-specific vocabulary replacement and adding a new task for numerical and quantitative reasoning, we introduce greater diversity and reduce shortcut learning. Our approach, combined with multi-task learning and the DeBERTa architecture, achieved significant performance gains on the NLI4CT 2024 benchmark compared to the original language models. Ablation studies validate the contribution of each augmentation method in improving robustness. Our best-performing model ranked 12th in terms of faithfulness and 8th in terms of consistency, respectively, out of the 32 participants.

pdf bib abs
SATLab at SemEval-2024 Task 1: A Fully Instance-Specific Approach for Semantic Textual Relatedness Prediction
Yves Bestgen

This paper presents the SATLab participation in SemEval 2024 Task 1 on Semantic Textual Relatedness. The proposed system predicts semantic relatedness by means of the Euclidean distance between the character ngram frequencies in the two sentences to evaluate. It employs no external resources, nor information from other instances present in the material. The system performs well, coming first in five of the twelve languages. However, there is little difference between the best systems.

pdf bib abs
Genaios at SemEval-2024 Task 8: Detecting Machine-Generated Text by Mixing Language Model Probabilistic Features
Areg Mikael Sarvazyan | José Ángel González | Marc Franco-salvador

This paper describes the participation of the Genaios team in the monolingual track of Subtask A at SemEval-2024 Task 8. Our best system, LLMixtic, is a Transformer Encoder that mixes token-level probabilistic features extracted from four LLaMA-2 models. We obtained the best results in the official ranking (96.88% accuracy), showing a false positive ratio of 4.38% and a false negative ratio of 1.97% on the test set. We further study LLMixtic through ablation, probabilistic, and attention analyses, finding that (i) performance improves as more LLMs and probabilistic features are included, (ii) LLMixtic puts most attention on the features of the last tokens, (iii) it fails on samples where human text probabilities become consistently higher than for generated text, and (iv) LLMixtic’s false negatives exhibit a bias towards text with newlines.

pdf bib abs
Self-StrAE at SemEval-2024 Task 1: Making Self-Structuring AutoEncoders Learn More With Less
Mattia Opper | Siddharth Narayanaswamy

We present two simple improvements to the Self-Structuring AutoEncoder (Self-StrAE). Firstly, we show that including reconstruction to the vocabulary as an auxiliary objective improves representation quality. Secondly, we demonstrate that increasing the number of independent channels leads to significant improvements in embedding quality, while simultaneously reducing the number of parameters. Surprisingly, we demonstrate that this trend can be followed to the extreme, even to point of reducing the total number of non-embedding parameters to seven. Our system can be pre-trained from scratch with as little as 10M tokens of input data, and proves effective across English, Spanish and Afrikaans.

pdf bib abs
RGAT at SemEval-2024 Task 2: Biomedical Natural Language Inference using Graph Attention Network
Abir Chakraborty

In this work, we (team RGAT) describe our approaches for the SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials (NLI4CT). The objective of this task is multi-evidence natural language inference based on different sections of clinical trial reports. We have explored various approaches, (a) dependency tree of the input query as additional features in a Graph Attention Network (GAT) along with the token and parts-of-speech features, (b) sequence-to-sequence approach using various models and synthetic data and finally, (c) in-context learning using large language models (LLMs) like GPT-4. Amongs these three approaches the best result is obtained from the LLM with 0.76 F1-score (the highest being 0.78), 0.86 in faithfulness and 0.74 in consistence.

This paper outlines our multimodal ensemble learning system for identifying persuasion techniques in memes. We contribute an approach which utilises the novel inclusion of consistent named visual entities extracted using Google Vision’s API as an external knowledge source, joined to our multimodal ensemble via late fusion. As well as detailing our experiments in ensemble combinations, fusion methods and data augmentation, we explore the impact of including external data and summarise post-evaluation improvements to our architecture based on analysis of the task results.

pdf bib abs
nowhash at SemEval-2024 Task 4: Exploiting Fusion of Transformers for Detecting Persuasion Techniques in Multilingual Memes
Abu Nowhash Chowdhury | Michal Ptaszynski

Nowadays, memes are considered one of the most prominent forms of medium to disseminate information on social media. Memes are typically constructed in multilingual settings using visuals with texts. Sometimes people use memes to influence mass audiences through rhetorical and psychological techniques, such as causal oversimplification, name-calling, and smear. It is a challenging task to identify those techniques considering memes’ multimodal characteristics. To address these challenges, SemEval-2024 Task 4 introduced a shared task focusing on detecting persuasion techniques in multilingual memes. This paper presents our participation in subtasks 1 and 2(b). We use a finetuned language-agnostic BERT sentence embedding (LaBSE) model to extract effective contextual features from meme text to address the challenge of identifying persuasion techniques in subtask 1. For subtask 2(b), We finetune the vision transformer and XLM-RoBERTa to extract effective contextual information from meme image and text data. Finally, we unify those features and employ a single feed-forward linear layer on top to obtain the prediction label. Experimental results on the SemEval 2024 Task 4 benchmark dataset manifested the potency of our proposed methods for subtasks 1 and 2(b).

pdf bib abs
HalluSafe at SemEval-2024 Task 6: An NLI-based Approach to Make LLMs Safer by Better Detecting Hallucinations and Overgeneration Mistakes
Zahra Rahimi | Hamidreza Amirzadeh | Alireza Sohrabi | Zeinab Taghavi | Hossein Sameti

The advancement of large language models (LLMs), their ability to produce eloquent and fluent content, and their vast knowledge have resulted in their usage in various tasks and applications. Despite generating fluent content, this content can contain fabricated or false information. This problem is known as hallucination and has reduced the confidence in the output of LLMs. In this work, we have used Natural Language Inference to train classifiers for hallucination detection to tackle SemEval-2024 Task 6-SHROOM (Mickus et al., 2024) which is defined in three sub-tasks: Paraphrase Generation, Machine Translation, and Definition Modeling. We have also conducted experiments on LLMs to evaluate their ability to detect hallucinated outputs. We have achieved 75.93% and 78.33% accuracy for the modelaware and model-agnostic tracks, respectively. The shared links of our models and the codes are available on GitHub.

pdf bib abs
NIMZ at SemEval-2024 Task 9: Evaluating Methods in Solving Brainteasers Defying Commonsense
Zahra Rahimi | Mohammad Moein Shirzady | Zeinab Taghavi | Hossein Sameti

The goal and dream of the artificial intelligence field have long been the development of intelligent systems or agents that mimic human behavior and thinking. Creativity is an essential trait in humans that is closely related to lateral thinking. The remarkable advancements in Language Models have led to extensive research on question-answering and explicit and implicit reasoning involving vertical thinking. However, there is an increasing need to shift focus towards research and development of models that can think laterally. One must step outside the traditional frame of commonsense concepts in lateral thinking to conclude. Task 9 of SemEval-2024 is Brainteaser (Jiang et al.,2024), which requires lateral thinking to answer riddle-like multiple-choice questions. In our study, we assessed the performance of various models for the Brainteaser task. We achieved an overall accuracy of 75% for the Sentence Puzzle subtask and 66.7% for the Word Puzzle subtask. All the codes, along with the links to our saved models, are available on our GitHub.

pdf bib abs
Mistral at SemEval-2024 Task 5: Mistral 7B for argument reasoning in Civil Procedure
Marco Siino

At the SemEval-2024 Task 5, the organizers introduce a novel natural language processing (NLP) challenge and dataset within the realm of the United States civil procedure. Each datum within the dataset comprises a comprehensive overview of a legal case, a specific inquiry associated with it, and a potential argument in support of a solution, supplemented with an in-depth rationale elucidating the applicability of the argument within the given context. Derived from a text designed for legal education purposes, this dataset presents a multifaceted benchmarking task for contemporary legal language models. Our manuscript delineates the approach we adopted for participation in this competition. Specifically, we detail the use of a Mistral 7B model to answer the question provided. Our only and best submission reach an F1-score equal to 0.5597 and an Accuracy of 0.5714, outperforming the baseline provided for the task.

SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse Large Language Models (LLMs) in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. To tackle this task, this paper proposes two methods: 1) using traditional machine learning (ML) with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. For fine-tuning, we use the train datasets provided by the task organizers. The results show that transformer models like LoRA-RoBERTa and XLM-RoBERTa outperform traditional ML models, particularly in multilingual subtasks. However, traditional ML models performed better than transformer models for the monolingual task, demonstrating the importance of considering the specific characteristics of each subtask when selecting an appropriate approach.

pdf bib abs
iML at SemEval-2024 Task 2: Safe Biomedical Natural Language Interference for Clinical Trials with LLM Based Ensemble Inferencing
Abbas Akkasi | Adnan Khan | Mai A. Shaaban | Majid Komeili | Mohammad Yaqub

We engaged in the shared task 2 at SenEval-2024, employing a diverse set of solutions with a particular emphasis on leveraging a Large Language Model (LLM) based zero-shot inference approach to address the challenge.

pdf bib abs
CLaC at SemEval-2024 Task 4: Decoding Persuasion in Memes – An Ensemble of Language Models with Paraphrase Augmentation
Kota Shamanth Ramanath Nayak | Leila Kosseim

This paper describes our approach to SemEval-2024 Task 4 subtask 1, focusing on hierarchical multi-label detection of persuasion techniques in meme texts. Our approach was based on fine-tuning individual language models (BERT, XLM-RoBERTa, and mBERT) and leveraging a mean-based ensemble model. Additional strategies included dataset augmentation through the TC dataset and paraphrase generation as well as the fine-tuning of individual classification thresholds for each class. During testing, our system outperformed the baseline in all languages except for Arabic, where no significant improvement was reached. Analysis of the results seem to indicate that our dataset augmentation strategy and per-class threshold fine-tuning may have introduced noise and exacerbated the dataset imbalance.

pdf bib abs
RDproj at SemEval-2024 Task 4: An Ensemble Learning Approach for Multilingual Detection of Persuasion Techniques in Memes
Yuhang Zhu

This paper introduces our bagging-based ensemble learning approach for the SemEval-2024 Task 4 Subtask 1, focusing on multilingual persuasion detection within meme texts. This task aims to identify persuasion techniques employed within meme texts, which is a hierarchical multilabel classification task. The given text may apply multiple techniques, and persuasion techniques have a hierarchical structure. However, only a few prior persuasion detection systems have utilized the hierarchical structure of persuasion techniques. In that case, we designed a multilingual bagging-based ensemble approach, incorporating a soft voting ensemble strategy to effectively exploit persuasion techniques’ hierarchical structure. Our methodology achieved the second position in Bulgarian and North Macedonian, third in Arabic, and eleventh in English.

Semantic Text Relatedness (STR), a measure of meaning similarity between text elements, has become a key focus in the field of Natural Language Processing (NLP). We describe SemEval-2024 task 1 on Semantic Textual Relatedness featuring three tracks: supervised learning, unsupervised learning and cross-lingual learning across African and Asian languages including Afrikaans, Algerian Arabic, Amharic, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. Our goal is to analyse the semantic representation of sentences textual relatedness trained on mBert, all-MiniLM-L6-v2 and Bert-Based-uncased. The effectiveness of these models is evaluated using the Spearman Correlation metric, which assesses the strength of the relationship between paired data. The finding reveals the viability of transformer models in multilingual STR tasks.

pdf bib abs
SCaLAR NITK at SemEval-2024 Task 5: Towards Unsupervised Question Answering system with Multi-level Summarization for Legal Text
Manvith Prabhu | Haricharana Srinivasa | Anand Kumar

This paper summarizes Team SCaLAR’s work on SemEval-2024 Task 5: Legal Argument Reasoning in Civil Procedure. To address this Binary Classification task, which was daunting due to the complexity of the Legal Texts involved, we propose a simple yet novel similarity and distance-based unsupervised approach to generate labels. Further, we explore the Multi-level fusion of Legal-Bert embeddings using ensemble features, including CNN, GRU, and LSTM. To address the lengthy nature of Legal explanation in the dataset, we introduce T5-based segment-wise summarization, which successfully retained crucial information, enhancing the model’s performance. Our unsupervised system witnessed a 20-point increase in macro F1-score on the development set and a 10-point increase on the test set, which is promising given its uncomplicated architecture.

pdf bib abs
Abdelhak at SemEval-2024 Task 9: Decoding Brainteasers, The Efficacy of Dedicated Models Versus ChatGPT
Abdelhak Kelious | Mounir Okirim

This study introduces a dedicated model aimed at solving the BRAINTEASER task 9 , a novel challenge designed to assess models’ lateral thinking capabilities through sentence and word puzzles. Our model demonstrates remarkable efficacy, securing Rank 1 in sentence puzzle solving during the test phase with an overall score of 0.98. Additionally, we explore the comparative performance of ChatGPT, specifically analyzing how variations in temperature settings affect its ability to engage in lateral thinking and problem-solving. Our findings indicate a notable performance disparity between the dedicated model and ChatGPT, underscoring the potential of specialized approaches in enhancing creative reasoning in AI.

pdf bib abs
OUNLP at SemEval-2024 Task 9: Retrieval-Augmented Generation for Solving Brain Teasers with LLMs
Vineet Saravanan | Steven Wilson

The advancement of natural language processing has given rise to a variety of large language models (LLMs) with capabilities extending into the realm of complex problem-solving, including brainteasers that challenge not only linguistic fluency but also logical reasoning. This paper documents our submission to the SemEval 2024 Brainteaser task, in which we investigate the performance of state-of-the-art LLMs, such as GPT-3.5, GPT-4, and the Gemini model, on a diverse set of brainteasers using prompt engineering as a tool to enhance the models’ problem-solving abilities. We experimented with a series of structured prompts ranging from basic to those integrating task descriptions and explanations. Through a comparative analysis, we sought to determine which combinations of model and prompt yielded the highest accuracy in solving these puzzles. Our findings provide a snapshot of the current landscape of AI problem-solving and highlight the nuanced nature of LLM performance, influenced by both the complexity of the tasks and the sophistication of the prompts employed.

pdf bib abs
NLP-LISAC at SemEval-2024 Task 1: Transformer-based approaches for Determining Semantic Textual Relatedness
Abdessamad Benlahbib | Anass Fahfouh | Hamza Alami | Achraf Boumhidi

This paper presents our system and findings for SemEval 2024 Task 1 Track A Supervised Semantic Textual Relatedness. The main objective of this task was to detect the degree of semantic relatedness between pairs of sentences. Our submitted models (ranked 6/24 in Algerian Arabic, 7/25 in Spanish, 12/23 in Moroccan Arabic, and 13/36 in English) consist of various transformer-based models including MARBERT-V2, mDeBERTa-V3-Base, DarijaBERT, and DeBERTa-V3-Large, fine-tuned using different loss functions including Huber Loss, Mean Absolute Error, and Mean Squared Error.

pdf bib abs
ZXQ at SemEval-2024 Task 7: Fine-tuning GPT-3.5-Turbo for Numerical Reasoning
Zhen Qian | Xiaofei Xu | Xiuzhen Zhang

In this paper, we present our system for the SemEval-2024 Task 7, i.e., NumEval subtask 3: Numericial Reasoning. Given a news article and its headline, the numerical reasoning task involves creating a system to compute the intentionally excluded number within the news headline. We propose a fine-tuned GPT-3.5-turbo model, specifically engineered to deduce missing numerals directly from the content of news article. The model is trained with a human-engineered prompt that itegrates the news content and the masked headline, tailoring its accuracy for the designated task. It achieves an accuracy of 0.94 on the test data and secures the second position in the official leaderboard. An examination on the system’s inference results reveals its commendable accuracy in identifying correct numerals when they can be directly “copied” from the articles. However, the error rates increase when it comes to some ambiguous operations such as rounding.

pdf bib abs
BAMO at SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense
Baktash Ansari | Mohammadmostafa Rostamkhani | Sauleh Eetemadi

This paper outlines our approach to SemEval 2024 Task 9, BRAINTEASER: A Novel Task Defying Common Sense. The task aims to evaluate the ability of language models to think creatively. The dataset comprises multi-choice questions that challenge models to think ‘outside of the box’. We fine-tune 2 models, BERT and RoBERTa Large. Next, we employ a Chain of Thought (CoT) zero-shot prompting approach with 6 large language models, such as GPT-3.5, Mixtral, and Llama2. Finally, we utilize ReConcile, a technique that employs a ‘round table conference’ approach with multiple agents for zero-shot learning, to generate consensus answers among 3 selected language models. Our best method achieves an overall accuracy of 85 percent on the sentence puzzles subtask.

pdf bib abs
yangqi at SemEval-2024 Task 9: Simulate Human Thinking by Large Language Model for Lateral Thinking Challenges
Qi Yang | Jingjie Zeng | Liang Yang | Hongfei Lin

This paper describes our system used in the SemEval-2024 Task 9 on two sub-tasks, BRAINTEASER: A Novel Task Defying Common Sense. In this work, we developed a system SHTL, which means simulate human thinking capabilities by Large Language Model (LLM). Our approach bifurcates into two main components: Common Sense Reasoning and Rationalize Defying Common Sense. To mitigate the hallucinations of LLM, we implemented a strategy that combines Retrieval-augmented Generation (RAG) with the the Self-Adaptive In-Context Learning (SAICL), thereby sufficiently leveraging the powerful language ability of LLM. The effectiveness of our method has been validated by its performance on the test set, with an average performance on two subtasks that is 30.1 higher than ChatGPT setting zero-shot and only 0.8 lower than that of humans.

pdf bib abs
BadRock at SemEval-2024 Task 8: DistilBERT to Detect Multigenerator, Multidomain and Multilingual Black-Box Machine-Generated Text
Marco Siino

The rise of Large Language Models (LLMs) has brought about a notable shift, rendering them increasingly ubiquitous and readily accessible. This accessibility has precipitated a surge in machine-generated content across diverse platforms encompassing news outlets, social media platforms, question-answering forums, educational platforms, and even academic domains. Recent iterations of LLMs, exemplified by entities like ChatGPT and GPT-4, exhibit a remarkable ability to produce coherent and contextually relevant responses across a broad spectrum of user inquiries. The fluidity and sophistication of these generated texts position LLMs as compelling candidates for substituting human labor in numerous applications. Nevertheless, this proliferation of machine-generated content has raised apprehensions regarding potential misuse, including the dissemination of misinformation and disruption of educational ecosystems. Given that humans marginally outperform random chance in discerning between machine-generated and human-authored text, there arises a pressing imperative to develop automated systems capable of accurately distinguishing machine-generated text. This pursuit is driven by the overarching objective of curbing the potential misuse of machine-generated content. Our manuscript delineates the approach we adopted for participation in this competition. Specifically, we detail the use of a DistilBERT model for classifying each sample in the test set provided. Our submission is able to reach an accuracy equal to 0.754 in place of the worst result obtained at the competition that is equal to 0.231.

pdf bib abs
WarwickNLP at SemEval-2024 Task 1: Low-Rank Cross-Encoders for Efficient Semantic Textual Relatedness
Fahad Ebrahim | Mike Joy

This work participates in SemEval 2024 Task 1 on Semantic Textural Relatedness (STR) in Track A (supervised regression) in two languages, English and Moroccan Arabic. The task consists of providing a score of how two sentences relate to each other. The system developed in this work leveraged a cross-encoder with a merged fine-tuned Low-Rank Adapter (LoRA). The system was ranked eighth in English with a Spearman coefficient of 0.842, while Moroccan Arabic was ranked seventh with a score of 0.816. Moreover, various experiments were conducted to see the impact of different models and adapters on the performance and accuracy of the system.

pdf bib abs
NU-RU at SemEval-2024 Task 6: Hallucination and Related Observable Overgeneration Mistake Detection Using Hypothesis-Target Similarity and SelfCheckGPT
Thanet Markchom | Subin Jung | Huizhi Liang

One of the key challenges in Natural Language Generation (NLG) is “hallucination,” in which the generated output appears fluent and grammatically sound but may contain incorrect information. To address this challenge, “SemEval-2024 Task 6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes” is introduced. This task focuses on detecting overgeneration hallucinations in texts generated from Large Language Models for various NLG tasks. To tackle this task, this paper proposes two methods: (1) hypothesis-target similarity, which measures text similarity between a generated text (hypothesis) and an intended reference text (target), and (2) a SelfCheckGPT-based method to assess hallucinations via predefined prompts designed for different NLG tasks. Experiments were conducted on the dataset provided in this task. The results show that both of the proposed methods can effectively detect hallucinations in LLM-generated texts with a possibility for improvement.

pdf bib abs
NCL_NLP at SemEval-2024 Task 7: CoT-NumHG: A CoT-Based SFT Training Strategy with Large Language Models for Number-Focused Headline Generation
Junzhe Zhao | Yingxi Wang | Huizhi Liang | Nicolay Rusnachenko

Headline Generation is an essential task in Natural Language Processing (NLP), where models often exhibit limited ability to accurately interpret numerals, leading to inaccuracies in generated headlines. This paper introduces CoT-NumHG, a training strategy leveraging the Chain of Thought (CoT) paradigm for Supervised Fine-Tuning (SFT) of large language models. This approach is aimed at enhancing numeral perception, interpretability, accuracy, and the generation of structured outputs. Presented in SemEval-2024 Task 7 (task 3): Numeral-Aware Headline Generation (English), this challenge is divided into two specific subtasks. The first subtask focuses on numerical reasoning, requiring models to precisely calculate and fill in the missing numbers in news headlines, while the second subtask targets the generation of complete headlines. Utilizing the same training strategy across both subtasks, this study primarily explores the first subtask as a demonstration of our training strategy. Through this competition, our CoT-NumHG-Mistral-7B model attained an accuracy rate of 94%, underscoring the effectiveness of our proposed strategy.

pdf bib abs
Byun at SemEval-2024 Task 6: Text Classification on Hallucinating Text with Simple Data Augmentation
Cheolyeon Byun

This paper aims to classify sentences to see if it is hallucinating, meaning the generative language model has output text that has very little to do with the user’s input, or not. This classification task is part of the Semeval 2024’s task on Hallucinations and Related Observable Over-generation Mistakes, AKA SHROOM, which aims to improve awkward-sounding texts generated by AI. This paper will first go over the first attempt at creating predictions, then show the actual scores achieved after submitting the first attempt results to Semeval, then finally go over potential improvements to be made.

pdf bib abs
DeepPavlov at SemEval-2024 Task 6: Detection of Hallucinations and Overgeneration Mistakes with an Ensemble of Transformer-based Models
Ivan Maksimov | Vasily Konovalov | Andrei Glinskii

The inclination of large language models (LLMs) to produce mistaken assertions, known as hallucinations, can be problematic. These hallucinations could potentially be harmful since sporadic factual inaccuracies within the generated text might be concealed by the overall coherence of the content, making it immensely challenging for users to identify them. The goal of the SHROOM shared-task is to detect grammatically sound outputs that contain incorrect or unsupported semantic information. Although there are a lot of existing hallucination detectors in generated AI content, we found out that pretrained Natural Language Inference (NLI) models yet exhibit success in detecting hallucinations. Moreover their ensemble outperforms more complicated models.

pdf bib abs
HIJLI_JU at SemEval-2024 Task 7: Enhancing Quantitative Question Answering Using Fine-tuned BERT Models
Partha Sengupta | Sandip Sarkar | Dipankar Das

In data and numerical analysis, Quantitative Question Answering (QQA) becomes a crucial instrument that provides deep insights for analyzing large datasets and helps make well-informed decisions in industries such as finance, healthcare, and business. This paper explores the “HIJLI_JU” team’s involvement in NumEval Task 1 within SemEval 2024, with a particular emphasis on quantitative comprehension. Specifically, our method addresses numerical complexities by fine-tuning a BERT model for sophisticated multiple-choice question answering, leveraging the Hugging Face ecosystem. The effectiveness of our QQA model is assessed using a variety of metrics, with an emphasis on the f1_score() of the scikit-learn library. Thorough analysis of the macro-F1, micro-F1, weighted-F1, average, and binary-F1 scores yields detailed insights into the model’s performance in a range of question formats.

pdf bib abs
NCL Team at SemEval-2024 Task 3: Fusing Multimodal Pre-training Embeddings for Emotion Cause Prediction in Conversations
Shu Li | Zicen Liao | Huizhi Liang

In this study, we introduce an MLP approach for extracting multimodal cause utterances in conversations, utilizing the multimodal conversational emotion causes from the ECF dataset. Our research focuses on evaluating a bi-modal framework that integrates video and audio embeddings to analyze emotional expressions within dialogues. The core of our methodology involves the extraction of embeddings from pre-trained models for each modality, followed by their concatenation and subsequent classification via an MLP network. We compared the accuracy performances across different modality combinations including text-audio-video, video-audio, and audio only.

pdf bib abs
DeBERTa at SemEval-2024 Task 9: Using DeBERTa for Defying Common Sense
Marco Siino

The widespread success of language models has spurred the natural language processing (NLP) community to tackle tasks demanding implicit and intricate reasoning, drawing upon human-like common-sense mechanisms. While endeavors in vertical thinking tasks have garnered considerable attention, there has been a relative dearth of exploration in lateral thinking puzzles. To address this gap, we introduce BRAINTEASER: a multiple-choice Question Answering task meticulously crafted to evaluate the model’s capacity for lateral thinking and its ability to challenge default common-sense associations. At the SemEval-2024 Task 9, for the first subtask (i.e., Sentence Puzzle) the organizers asked the participants to develop models able to reply to multi-answer brain-teasing questions. For this purpose, we propose the application of a DeBERTa model in a zero-shot configuration. Our proposed approach is able to reach an overall score of 0.250. Suggesting a significant room for improvements in future works.

pdf bib abs
TransMistral at SemEval-2024 Task 10: Using Mistral 7B for Emotion Discovery and Reasoning its Flip in Conversation
Marco Siino

The EDiReF shared task at SemEval 2024 comprises three subtasks: Emotion Recognition in Conversation (ERC) in Hindi-English code-mixed conversations, Emotion Flip Reasoning (EFR) in Hindi-English code-mixed conversations, and EFR in English conversations. The objectives for the ERC and EFR tasks are defined as follows: 1) Emotion Recognition in Conversation (ERC): In this task, participants are tasked with assigning an emotion to each utterance within a dialogue from a predefined set of possible emotions. The goal is to accurately recognize and label the emotions expressed in the conversation; 2) Emotion Flip Reasoning (EFR): This task involves identifying the trigger utterance(s) for an emotion-flip within a multi-party conversation dialogue. Participants are required to pinpoint the specific utterance(s) that serve as catalysts for a change in emotion during the conversation. In this paper we only address the first subtask (ERC) making use of an online translation strategy followed by the application of a Mistral 7B model together with a few-shot prompt strategy. Our approach obtains an F1 of 0.36, eventually exhibiting further room for improvements.

pdf bib abs
0x.Yuan at SemEval-2024 Task 2: Agents Debating can reach consensus and produce better outcomes in Medical NLI task
Yu-an Lu | Hung-yu Kao

In this paper, we introduce a multi-agent debating framework, experimenting on SemEval 2024 Task 2. This innovative system employs a collaborative approach involving expert agents from various medical fields to analyze Clinical Trial Reports (CTRs). Our methodology emphasizes nuanced and comprehensive analysis by leveraging the diverse expertise of agents like Biostatisticians and Medical Linguists. Results indicate that our collaborative model surpasses the performance of individual agents in terms of Macro F1-score. Additionally, our analysis suggests that while initial debates often mirror majority decisions, the debating process refines these outcomes, demonstrating the system’s capability for in-depth analysis beyond simple majority rule. This research highlights the potential of AI collaboration in specialized domains, particularly in medical text interpretation.

pdf bib abs
TW-NLP at SemEval-2024 Task10: Emotion Recognition and Emotion Reversal Inference in Multi-Party Dialogues.
Wei Tian | Peiyu Ji | Lei Zhang | Yue Jian

In multidimensional dialogues, emotions serve not only as crucial mediators of emotional exchanges but also carry rich information. Therefore, accurately identifying the emotions of interlocutors and understanding the triggering factors of emotional changes are paramount. This study focuses on the tasks of multilingual dialogue emotion recognition and emotion reversal reasoning based on provocateurs, aiming to enhance the accuracy and depth of emotional understanding in dialogues. To achieve this goal, we propose a novel model, MBERT-TextRCNN-PL, designed to effectively capture emotional information of interlocutors. Additionally, we introduce XGBoost-EC (Emotion Capturer) to identify emotion provocateurs, thereby delving deeper into the causal relationships behind emotional changes. By comparing with state-of-the-art models, our approach demonstrates significant improvements in recognizing dialogue emotions and provocateurs, offering new insights and methodologies for multilingual dialogue emotion understanding and emotion reversal research.

In this paper, we present an approach for solving SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations. The task includes two subtasks that focus on emotion-cause pair extraction using text, video, and audio modalities. Our approach is composed of encoding all modalities (MFCC and Wav2Vec for audio, 3D-CNN for video, and transformer-based models for text) and combining them in an utterance-level fusion module. The model is then optimized for link and emotion prediction simultaneously. Our approach achieved 6th place in both subtasks. The full leaderboard can be found at https://codalab.lisn.upsaclay.fr/competitions/16141#results

pdf bib abs
GAVx at SemEval-2024 Task 10: Emotion Flip Reasoning via Stacked Instruction Finetuning of LLMs
Vy Nguyen | Xiuzhen Zhang

The Emotion Flip Reasoning task at SemEval 2024 aims at identifying the utterance(s) that trigger a speaker to shift from an emotion to another in a multi-party conversation. The spontaneous, informal, and occasionally multilingual dynamics of conversations make the task challenging. In this paper, we propose a supervised stacked instruction-based framework to finetune large language models to tackle this task. Utilising the annotated datasets provided, we curate multiple instruction sets involving chain-of-thoughts, feedback, and self-evaluation instructions, for a multi-step finetuning pipeline. We utilise the self-consistency inference strategy to enhance prediction consistency. Experimental results reveal commendable performance, achieving mean F1 scores of 0.77 and 0.76 for triggers in the Hindi-English and English-only tracks respectively. This led to us earning the second highest ranking in both tracks.

pdf bib abs
NLP_STR_teamS at SemEval-2024 Task1: Semantic Textual Relatedness based on MASK Prediction and BERT Model
Lianshuang Su | Xiaobing Zhou

This paper describes our participation in the SemEval-2024 Task 1, “Semantic Textual Relatedness for African and Asian Languages.” This task detects the degree of semantic relatedness between pairs of sentences. Our approach is to take out the sentence pairs of each instance to construct a new sentence as the prompt template, use MASK to predict the correlation between the two sentences, use the BERT pre-training model to process and calculate the text sequence, and use the synonym replacement method in text data augmentation to expand the size of the data set. We participate in English in track A, which uses a supervised approach, and the Spearman Correlation on the test set is 0.809.

pdf bib abs
Halu-NLP at SemEval-2024 Task 6: MetaCheckGPT - A Multi-task Hallucination Detection using LLM uncertainty and meta-models
Rahul Mehta | Andrew Hoblitzell | Jack O’keefe | Hyeju Jang | Vasudeva Varma

Hallucinations in large language models(LLMs) have recently become a significantproblem. A recent effort in this directionis a shared task at Semeval 2024 Task 6,SHROOM, a Shared-task on Hallucinationsand Related Observable Overgeneration Mis-takes. This paper describes our winning so-lution ranked 1st and 2nd in the 2 sub-tasksof model agnostic and model aware tracks re-spectively. We propose a meta-regressor basedensemble of LLMs based on a random forestalgorithm that achieves the highest scores onthe leader board. We also experiment with var-ious transformer based models and black boxmethods like ChatGPT, Vectara, and others. Inaddition, we perform an error analysis com-paring ChatGPT against our best model whichshows the limitations of the former

pdf bib abs
QFNU_CS at SemEval-2024 Task 3: A Hybrid Pre-trained Model based Approach for Multimodal Emotion-Cause Pair Extraction Task
Zining Wang | Yanchao Zhao | Guanghui Han | Yang Song

This article presents the solution of Qufu Normal University for the Multimodal Sentiment Cause Analysis competition in SemEval2024 Task 3.The competition aims to extract emotion-cause pairs from dialogues containing text, audio, and video modalities. To cope with this task, we employ a hybrid pre-train model based approach. Specifically, we first extract and fusion features from dialogues based on BERT, BiLSTM, openSMILE and C3D. Then, we adopt BiLSTM and Transformer to extract the candidate emotion-cause pairs. Finally, we design a filter to identify the correct emotion-cause pairs. The evaluation results show that, we achieve a weighted average F1 score of 0.1786 and an F1 score of 0.1882 on CodaLab.

pdf bib abs
NewbieML at SemEval-2024 Task 8: Ensemble Approach for Multidomain Machine-Generated Text Detection
Bao Tran | Nhi Tran

Large Language Models (LLMs) are becoming popular and easily accessible, leading to a large growth of machine-generated content over various channels. Along with this popularity, the potential misuse is also a challenge for us. In this paper, we use SemEval 2024 task A monolingual dataset with comparative study between some machine learning model with feature extraction and develop an ensemble method for our system. Our system achieved 84.31% accuracy score in the test set, ranked 36th of 137 participants. Our code is available at: https://github.com/baoivy/SemEval-Task8

pdf bib abs
Hidetsune at SemEval-2024 Task 3: A Simple Textual Approach to Emotion Classification and Emotion Cause Analysis in Conversations Using Machine Learning and Next Sentence Prediction
Hidetsune Takahashi

In this system paper for SemEval-2024 Task3 subtask 2, I present my simple textual approach to emotion classification and emotioncause analysis in conversations using machinelearning and next sentence prediction. I train aSpaCy model for emotion classification and usenext sentence prediction with BERT for emotion cause analysis. While speaker names andaudio-visual clips are given in addition to textof the conversations, my approach uses textualdata only to test my methodology to combinemachine learning with next sentence prediction.This paper reveals both strengths and weaknesses of my trial, suggesting a direction offuture studies to improve my introductory solution.

pdf bib abs
CLTeam1 at SemEval-2024 Task 10: Large Language Model based ensemble for Emotion Detection in Hinglish
Ankit Vaidya | Aditya Gokhale | Arnav Desai | Ishaan Shukla | Sheetal Sonawane

This paper outlines our approach for the ERC subtask of the SemEval 2024 EdiREF Shared Task. In this sub-task, an emotion had to be assigned to an utterance which was the part of a dialogue. The utterance had to be classified into one of the following classes- disgust, contempt, anger, neutral, joy, sadness, fear, surprise. Our proposed system makes use of an ensemble of language specific RoBERTA and BERT models to tackle the problem. A weighted F1-score of 44% was achieved by our system in this task. We conducted comprehensive ablations and suggested directions of future work. Our codebase is available publicly.

pdf bib abs
Hidetsune at SemEval-2024 Task 4: An Application of Machine Learning to Multilingual Propagandistic Memes Identification Using Machine Translation
Hidetsune Takahashi

In this system paper for SemEval-2024 Task4 subtask 2b, I present my approach to identifying propagandistic memes in multiple languages. I firstly establish a baseline for Englishand then implement the model into other languages (Bulgarian, North Macedonian and Arabic) by using machine translation. Data fromother subtasks (subtask 1, subtask 2a) are alsoused in addition to data for this subtask, andadditional data from Kaggle are concatenatedto these in order to enhance the model. Theresults show a high reliability of my Englishbaseline and a room for improvement of itsimplementation.

pdf bib abs
Hidetsune at SemEval-2024 Task 10: An English Based Approach to Emotion Recognition in Hindi-English code-mixed Conversations Using Machine Learning and Machine Translation
Hidetsune Takahashi

In this system paper for SemEval-2024 Task10 subtask 1 (ERC), I present my approach torecognizing emotions in Hindi-English codemixed conversations. I train a SpaCy modelwith English translated data and classify emotions behind Hindi-English code-mixed utterances by using the model and translating theminto English. I use machine translation to translate all the data in Hindi-English mixed language into English due to an easy access to existing data for emotion recognition in English.Some additional data in English are used to enhance my model. This English based approachdemonstrates a fundamental possibility and potential of simplifying code-mixed language intoone major language for emotion recognition.

pdf bib abs
All-Mpnet at SemEval-2024 Task 1: Application of Mpnet for Evaluating Semantic Textual Relatedness
Marco Siino

In this study, we tackle the task of automatically discerning the level of semantic relatedness between pairs of sentences. Specifically, Task 1 at SemEval-2024 involves predicting the Semantic Textual Relatedness (STR) of sentence pairs. Participants are tasked with ranking sentence pairs based on their proximity in meaning, quantified by their degree of semantic relatedness, across 14 different languages. Each sentence pair is assigned manually determined relatedness scores ranging from 0 (indicating complete lack of relation) to 1 (denoting maximum relatedness). In our submitted approach on the official test set, focusing on Task 1 (a supervised task in English and Spanish), we achieve a Spearman rank correlation coefficient of 0.808 for the English language and 0.611 for the Spanish language.

pdf bib abs
0x.Yuan at SemEval-2024 Task 5: Enhancing Legal Argument Reasoning with Structured Prompts
Yu-an Lu | Hung-yu Kao

The intersection of legal reasoning and Natural Language Processing (NLP) technologies, particularly Large Language Models (LLMs), offers groundbreaking potential for augmenting human capabilities in the legal domain. This paper presents our approach and findings from participating in SemEval-2024 Task 5, focusing on the effect of argument reasoning in civil procedures using legal reasoning prompts. We investigated the impact of structured legal reasoning methodologies, including TREACC, IRAC, IRAAC, and MIRAC, on guiding LLMs to analyze and evaluate legal arguments systematically. Our experimental setup involved crafting specific prompts based on these methodologies to instruct the LLM to dissect and scrutinize legal cases, aiming to discern the cogency of argumentative solutions within a zero-shot learning framework. The performance of our approach, as measured by F1 score and accuracy, demonstrated the efficacy of integrating structured legal reasoning into LLMs for legal analysis. The findings underscore the promise of LLMs, when equipped with legal reasoning prompts, in enhancing their ability to process and reason through complex legal texts, thus contributing to the broader application of AI in legal studies and practice.

pdf bib abs
Groningen team D at SemEval-2024 Task 8: Exploring data generation and a combined model for fine-tuning LLMs for Multidomain Machine-Generated Text Detection
Thijs Brekhof | Xuanyi Liu | Joris Ruitenbeek | Niels Top | Yuwen Zhou

In this system description, we describe our process and the systems that we created for the subtasks A monolingual, A multilingual, and B forthe SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection. This shared task aimsat detecting and differentiating between machinegenerated text and human-written text. SubtaskA is focused on detecting if a text is machinegenerated or human-written both in a monolingualand a multilingual setting. Subtask B is also focused on detecting if a text is human-written ormachine-generated, though it takes it one step further by also requiring the detection of the correct language model used for generating the text.For the monolingual aspects of this task, our approach is centered around fine-tuning a debertav3-large LM. For the multilingual setting, we created an ensemble model utilizing different monolingual models and a language identification toolto classify each text. We also experiment with thegeneration of extra training data. Our results showthat the generation of extra data aids our modelsand leads to an increase in accuracy.

pdf bib abs
Kathlalu at SemEval-2024 Task 8: A Comparative Analysis of Binary Classification Methods for Distinguishing Between Human and Machine-generated Text
Lujia Cao | Ece Lara Kilic | Katharina Will

This paper investigates two methods for constructing a binary classifier to distinguish between human-generated and machine-generated text. The main emphasis is on a straightforward approach based on Zipf’s law, which, despite its simplicity, achieves a moderate level of performance. Additionally, the paper briefly discusses experimentation with the utilization of unigram word counts.

pdf bib abs
Team Unibuc - NLP at SemEval-2024 Task 8: Transformer and Hybrid Deep Learning Based Models for Machine-Generated Text Detection
Teodor-george Marchitan | Claudiu Creanga | Liviu P. Dinu

This paper describes the approach of the UniBuc - NLP team in tackling the SemEval 2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection. We explored transformer-based and hybrid deep learning architectures. For subtask B, our transformer-based model achieved a strong second-place out of 77 teams with an accuracy of 86.95%, demonstrating the architecture’s suitability for this task. However, our models showed overfitting in subtask A which could potentially be fixed with less fine-tunning and increasing maximum sequence length. For subtask C (token-level classification), our hybrid model overfit during training, hindering its ability to detect transitions between human and machine-generated text.

The “Emotion Discovery and Reasoning Its Flip in Conversation” task at the SemEval 2024 competition focuses on the automatic recognition of emotion flips, triggered within multi-party textual conversations. This paper proposes a novel approach that draws a parallel between a mixed strategy and a comparative strategy, contrasting a Rule-Based Function with Named Entity Recognition (NER)—an approach that shows promise in understanding speaker-specific emotional dynamics. Furthermore, this method surpasses the performance of both DistilBERT and RoBERTa models, demonstrating competitive effectiveness in detecting emotion flips triggered in multi-party textual conversations, achieving a 70% F1-score. This system was ranked 6th in the SemEval 2024 competition for Subtask 3.

pdf bib abs
Text Mining at SemEval-2024 Task 1: Evaluating Semantic Textual Relatedness in Low-resource Languages using Various Embedding Methods and Machine Learning Regression Models
Ron Keinan

In this paper, I describe my submission to the SemEval-2024 contest. I tackled subtask 1 - “Semantic Textual Relatedness for African and Asian Languages”. To find the semantic relatedness of sentence pairs, I tackled this task by creating models for nine different languages. I then vectorized the text data using a variety of embedding techniques including doc2vec, tf-idf, Sentence-Transformers, Bert, Roberta, and more, and used 11 traditional machine learning techniques of the regression type for analysis and evaluation.

pdf bib abs
USMBA-NLP at SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials using Bert
Anass Fahfouh | Abdessamad Benlahbib | Jamal Riffi | Hamid Tairi

This paper presents the application of BERT inSemEval 2024 Task 2, Safe Biomedical Natu-ral Language Inference for Clinical Trials. Themain objectives of this task were: First, to in-vestigate the consistency of BERT in its rep-resentation of semantic phenomena necessaryfor complex inference in clinical NLI settings.Second, to investigate the ability of BERT toperform faithful reasoning, i.e., make correctpredictions for the correct reasons. The submit-ted model is fine-tuned on the NLI4CT dataset,which is enhanced with a novel contrast set,using binary cross entropy loss.

pdf bib abs
CRCL at SemEval-2024 Task 2: Simple prompt optimizations
Clement Brutti-mairesse | Loic Verlingue

We present a baseline for the SemEval 2024 task 2 challenge, whose objective is to ascertain the inference relationship between pairs of clinical trial report sections and statements.We apply prompt optimization techniques with LLM Instruct models provided as a Language Model-as-a-Service (LMaaS).We observed, in line with recent findings, that synthetic CoT prompts significantly enhance manually crafted ones.The source code is available at this GitHub repository https://github.com/ClementBM-CLB/semeval-2024

pdf bib abs
SuteAlbastre at SemEval-2024 Task 4: Predicting Propaganda Techniques in Multilingual Memes using Joint Text and Vision Transformers
Ion Anghelina | Gabriel Buță | Alexandru Enache

The main goal of this year’s SemEval Task 4 isdetecting the presence of persuasion techniquesin various meme formats. While Subtask 1targets text-only posts, Subtask 2, subsectionsa and b tackle posts containing both imagesand captions. The first 2 subtasks consist ofmulti-class and multi-label classifications, inthe context of a hierarchical taxonomy of 22different persuasion techniques.This paper proposes a solution for persuasiondetection in both these scenarios and for vari-ous languages of the caption text. Our team’smain approach consists of a Multimodal Learn-ing Neural Network architecture, having Tex-tual and Vision Transformers as its backbone.The models that we have experimented with in-clude EfficientNet and ViT as visual encodersand BERT and GPT2 as textual encoders.

pdf bib abs
RFBES at SemEval-2024 Task 8: Investigating Syntactic and Semantic Features for Distinguishing AI-Generated and Human-Written Texts
Mohammad Heydari Rad | Farhan Farsi | Shayan Bali | Romina Etezadi | Mehrnoush Shamsfard

Nowadays, the usage of Large Language Models (LLMs) has increased, and LLMs have been used to generate texts in different languages and for different tasks. Additionally, due to the participation of remarkable companies such as Google and OpenAI, LLMs are now more accessible, and people can easily use them. However, an important issue is how we can detect AI-generated texts from human-written ones. In this article, we have investigated the problem of AI-generated text detection from two different aspects: semantics and syntax. Finally, we presented an AI model that can distinguish AI-generated texts from human-written ones with high accuracy on both multilingual and monolingual tasks using the M4 dataset. According to our results, using a semantic approach would be more helpful for detection. However, there is a lot of room for improvement in the syntactic approach, and it would be a good approach for future work.

This paper describes the BAMBAS team’s participation in SemEval-2024 Task 4 Subtask 1, which focused on the multilabel classification of persuasion techniques in the textual content of Internet memes. We explored a lightweight approach that does not consider the hierarchy of labels. First, we get the text embeddings leveraging the multilingual tweets-based language model, Bernice. Next, we use those embeddings to train a separate binary classifier for each label, adopting independent oversampling strategies in each model in a binary-relevance style. We tested our approach over the English dataset, exceeding the baseline by 21 percentage points, while ranking in 23th in terms of hierarchical F1 and 11st in terms of hierarchical recall.

pdf bib abs
Team QUST at SemEval-2024 Task 8: A Comprehensive Study of Monolingual and Multilingual Approaches for Detecting AI-generated Text
Xiaoman Xu | Xiangrun Li | Taihang Wang | Jianxiang Tian | Ye Jiang

This paper presents the participation of team QUST in Task 8 SemEval 2024. we first performed data augmentation and cleaning on the dataset to enhance model training efficiency and accuracy. In the monolingual task, we evaluated traditional deep-learning methods, multiscale positive-unlabeled framework (MPU), fine-tuning, adapters and ensemble methods. Then, we selected the top-performing models based on their accuracy from the monolingual models and evaluated them in subtasks A and B. The final model construction employed a stacking ensemble that combined fine-tuning with MPU. Our system achieved 6th (scored 6th in terms of accuracy, officially ranked 13th in order) place in the official test set in multilingual settings of subtask A. We release our system code at:https://github.com/warmth27/SemEval2024_QUST

pdf bib abs
YNU-HPCC at SemEval-2024 Task 9: Using Pre-trained Language Models with LoRA for Multiple-choice Answering Tasks
Jie Wang | Jin Wang | Xuejie Zhang

This study describes the model built in Task 9: brainteaser in the SemEval-2024 competition, which is a multiple-choice task. As active participants in Task 9, our system strategically employs the decoding-enhanced BERT (DeBERTa) architecture enriched with disentangled attention mechanisms. Additionally, we fine-tuned our model using low-rank adaptation (LoRA) to optimize its performance further. Moreover, we integrate focal loss into our framework to address label imbalance issues. The systematic integration of these techniques has resulted in outstanding performance metrics. Upon evaluation using the provided test dataset, our system showcases commendable results, with a remarkable accuracy score of 0.9 for subtask 1, positioning us fifth among all participants. Similarly, for subtask 2, our system exhibits a substantial accuracy rate of 0.781, securing a commendable seventh-place ranking. The code for this paper is published at: https://github.com/123yunnandaxue/Semveal-2024_task9.

pdf bib abs
Team jelarson at SemEval 2024 Task 8: Predicting Boundary Line Between Human and Machine Generated Text
Joseph Larson | Francis Tyers

In this paper, we handle the task of building a system that, given a document written first by a human and then finished by an LLM, the system must determine the transition word i.e. where the machine begins to write. We built a system by examining the data for textual anomalies and combining a method of heuristic approaches with a linear regression model based on the text length of each document.

pdf bib abs
HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text?
Shubhashis Roy Dipta | Sadat Shahriar

This paper describes our system developed for SemEval-2024 Task 8, “Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection.” Machine-generated texts have been one of the main concerns due to the use of large language models (LLM) in fake text generation, phishing, cheating in exams, or even plagiarizing copyright materials. A lot of systems have been developed to detect machine-generated text. Nonetheless, the majority of these systems rely on the text-generating model. This limitation is impractical in real-world scenarios, as it’s often impossible to know which specific model the user has used for text generation. In this work, we propose a single model based on contrastive learning, which uses ~40% of the baseline’s parameters (149M vs. 355M) but shows a comparable performance on the test dataset (21st out of 137 participants). Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.

pdf bib abs
Team AT at SemEval-2024 Task 8: Machine-Generated Text Detection with Semantic Embeddings
Yuchen Wei

This study investigates the detection of machine-generated text using several semantic embedding techniques, a critical issue in the era of advanced language models. Different methodologies were examined: GloVe embeddings, N-gram embedding models, Sentence BERT, and a concatenated embedding approach, against a fine-tuned RoBERTa baseline. The research was conducted within the framework of SemEval-2024 Task 8, encompassing tasks for binary and multi-class classification of machine-generated text.

pdf bib abs
JN666 at SemEval-2024 Task 7: NumEval: Numeral-Aware Language Understanding and Generation
Xinyi Liu | Xintong Liu | Hengyang Lu

This paper is submitted for SemEval-2027 task 7: Enhancing the Model’s Understanding and Generation of Numerical Values. The dataset for this task is NQuAD, which requires us to select the most suitable option number from four numerical options to fill in the blank in a news article based on the context. Based on the BertForMultipleChoice model, we proposed two new models, MC BERT and SSC BERT, and improved the model’s numerical understanding ability by pre-training the model on numerical comparison tasks. Ultimately, our best-performing model achieved an accuracy rate of 79.40%, which is 9.45% higher than the accuracy rate of NEMo.

pdf bib abs
BERTastic at SemEval-2024 Task 4: State-of-the-Art Multilingual Propaganda Detection in Memes via Zero-Shot Learning with Vision-Language Models
Tarek Mahmoud | Preslav Nakov

Analyzing propagandistic memes in a multilingual, multimodal dataset is a challenging problem due to the inherent complexity of memes’ multimodal content, which combines images, text, and often, nuanced context. In this paper, we use a VLM in a zero-shot approach to detect propagandistic memes and achieve a state-of-the-art average macro F1 of 66.7% over all languages. Notably, we outperform other systems on North Macedonian memes, and obtain competitive results on Bulgarian and Arabic memes. We also present our early fusion approach for identifying persuasion techniques in memes in a hierarchical multilabel classification setting. This approach outperforms all other approaches in average hierarchical precision with an average score of 77.66%. The systems presented contribute to the evolving field of research on the detection of persuasion techniques in multimodal datasets by offering insights that could be of use in the development of more effective tools for combating online propaganda.

pdf bib abs
RKadiyala at SemEval-2024 Task 8: Black-Box Word-Level Text Boundary Detection in Partially Machine Generated Texts
Ram Mohan Rao Kadiyala

With increasing usage of generative models for text generation and widespread use of machine generated texts in various domains, being able to distinguish between human written and machine generated texts is a significant challenge. While existing models and proprietary systems focus on identifying whether given text is entirely human written or entirely machine generated, only a few systems provide insights at sentence or paragraph level at likelihood of being machine generated at a non reliable accuracy level, working well only for a set of domains and generators. This paper introduces few reliable approaches for the novel task of identifying which part of a given text is machine generated at a word level while comparing results from different approaches and methods. We present a comparison with proprietary systems , performance of our model on unseen domains’ and generators’ texts. The findings reveal significant improvements in detection accuracy along with comparison on other aspects of detection capabilities. Finally we discuss potential avenues for improvement and implications of our work. The proposed model is also well suited for detecting which parts of a text are machine generated in outputs of Instruct variants of many LLMs.

pdf bib abs
TLDR at SemEval-2024 Task 2: T5-generated clinical-Language summaries for DeBERTa Report Analysis
Spandan Das | Vinay Samuel | Shahriar Noroozizadeh

This paper introduces novel methodologies for the Natural Language Inference for Clinical Trials (NLI4CT) task. We present TLDR (T5-generated clinical-Language summaries for DeBERTa Report Analysis) which incorporates T5-model generated premise summaries for improved entailment and contradiction analysis in clinical NLI tasks. This approach overcomes the challenges posed by small context windows and lengthy premises, leading to a substantial improvement in Macro F1 scores: a 0.184 increase over truncated premises. Our comprehensive experimental evaluation, including detailed error analysis and ablations, confirms the superiority of TLDR in achieving consistency and faithfulness in predictions against semantically altered inputs.

pdf bib abs
ignore at SemEval-2024 Task 5: A Legal Classification Model with Summary Generation and Contrastive Learning
Binjie Sun | Xiaobing Zhou

This paper describes our work for SemEval-2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure. After analyzing the task requirements and the training dataset, we used data augmentation, adopted the large model GPT for summary generation, and added supervised contrastive learning to the basic BERT model. Our system achieved an F1 score of 0.551, ranking 14th in the competition leaderboard. Our system achieves an F1 score improvement of 0.1241 over the official baseline model.

In human-computer interaction, it is crucial for agents to respond to human by understanding their emotions. unraveling the causes of emotions is more challenging. A new task named Multimodal Emotion-Cause Pair Extraction in Conversations is responsible for recognizing emotion and identifying causal expressions. In this study, we propose a multi-stage framework to generate emotion and extract the emotion causal pairs given the target emotion. In the first stage, LLaMA2-based InstructERC is utilized to extract the emotion category of each utterance in a conversation. After emotion recognition, a two-stream attention model is employed to extract the emotion causal pairs given the target emotion for subtask 2 while MuTEC is employed to extract causal span for subtask 1. Our approach achieved first place for both of the two subtasks in the competition.

pdf bib abs
Werkzeug at SemEval-2024 Task 8: LLM-Generated Text Detection via Gated Mixture-of-Experts Fine-Tuning
Youlin Wu | Kaichun Wang | Kai Ma | Liang Yang | Hongfei Lin

Recent advancements in Large Language Models (LLMs) have propelled text generation to unprecedented heights, approaching human-level quality. However, it poses a new challenge to distinguish LLM-generated text from human-written text. Presently, most methods address this issue through classification, achieved by fine-tuning on small language models. Unfortunately, small language models suffer from anisotropy issue, where encoded text embeddings become difficult to differentiate in the latent space. Moreover, LLMs possess the ability to alter language styles with versatility, further complicating the classification task. To tackle these challenges, we propose Gated Mixture-of-Experts Fine-tuning (GMoEF) to detect LLM-generated text. GMoEF leverages parametric whitening to normalize text embeddings, thereby mitigating the anisotropy problem. Additionally, GMoEF employs the mixture-of-experts framework equipped with gating router to capture features of LLM-generated text from multiple perspectives. Our GMoEF achieved an impressive ranking of #8 out of 70 teams. The source code is available on https://gitlab.com/sigrs/gmoef.

pdf bib abs
SSN_Semeval10 at SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversations
Antony Rajesh | Supriya Abirami | Aravindan Chandrabose | Senthil Kumar

This paper presents a transformer-based model for recognizing emotions in Hindi-English code-mixed conversations, adhering to the SemEval task constraints. Leveraging BERT-based transformers, we fine-tune pre-trained models on the dataset, incorporating tokenization and attention mechanisms. Our approach achieves competitive performance (weighted F1-score of 0.4), showcasing the effectiveness of BERT in nuanced emotion analysis tasks within code-mixed conversational contexts.

pdf bib abs
KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection
Michal Spiegel | Dominik Macko

SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. Such a detection is important for preventing a potential misuse of large language models (LLMs), the newest of which are very capable in generating multilingual human-like texts. We have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller LLMs for text classification. We have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. Our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.

In this paper, we delve into the realm of detecting machine-generated text (MGT) within Natural Language Processing (NLP). Our approach involves fine-tuning a RoBERTa-base Transformer, a robust neural architecture, to tackle MGT detection as a binary classification task. Specifically focusing on Subtask A (Monolingual - English) within the SemEval-2024 competition framework, our system achieves a 78.9% accuracy on the test dataset, placing us 57th among participants. While our system demonstrates proficiency in identifying human-written texts, it faces challenges in accurately discerning MGTs.

pdf bib abs
IRIT-Berger-Levrault at SemEval-2024: How Sensitive Sentence Embeddings are to Hallucinations?
Nihed Bendahman | Karen Pinel-sauvagnat | Gilles Hubert | Mokhtar Billami

This article presents our participation to Task 6 of SemEval-2024, named SHROOM (a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes), which aims at detecting hallucinations. We propose two types of approaches for the task: the first one is based on sentence embeddings and cosine similarity metric, and the second one uses LLMs (Large Language Model). We found that LLMs fail to improve the performance achieved by embedding generation models. The latter outperform the baseline provided by the organizers, and our best system achieves 78% accuracy.

pdf bib abs
CYUT at SemEval-2024 Task 7: A Numerals Augmentation and Feature Enhancement Approach to Numeral Reading Comprehension
Tsz-yeung Lau | Shih-hung Wu

This study explores Task 2 in NumEval-2024, which is SemEval-2024(Semantic Evaluation)Task 7 , focusing on the Reading Comprehension of Numerals in Text (Chinese). The datasetutilized in this study is the Numeral-related Question Answering Dataset (NQuAD), and the model employed is BERT. The data undergoes preprocessing, incorporating Numerals Augmentation and Feature Enhancement to numerical entities before model training. Additionally, fine-tuning will also be applied. The result was an accuracy rate of 77.09%, representing a 7.14% improvement compared to the initial NQuAD processing model, referred to as the Numeracy-Enhanced Model (NEMo).

pdf bib abs
UniBuc at SemEval-2024 Task 2: Tailored Prompting with Solar for Clinical NLI
Marius Micluta-Campeanu | Claudiu Creanga | Ana-maria Bucur | Ana Sabina Uban | Liviu P. Dinu

This paper describes the approach of the UniBuc team in tackling the SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We used SOLAR Instruct, without any fine-tuning, while focusing on input manipulation and tailored prompting. By customizing prompts for individual CTR sections, in both zero-shot and few-shots settings, we managed to achieve a consistency score of 0.72, ranking 14th in the leaderboard. Our thorough error analysis revealed that our model has a tendency to take shortcuts and rely on simple heuristics, especially when dealing with semantic-preserving changes.

pdf bib abs
Fralak at SemEval-2024 Task 4: combining RNN-generated hierarchy paths with simple neural nets for hierarchical multilabel text classification in a multilingual zero-shot setting
Katarina Laken

This paper describes the submission of team fralak for subtask 1 of task 4 of the Semeval-2024 shared task: ‘Multilingual detection of persuasion techniques in memes’. The first subtask included only the textual content of the memes. We restructured the labels into strings that showed the full path through the hierarchy. The system includes an RNN module that is trained to generate these strings. This module was then incorporated in an ensemble model with 2 more models consisting of basic fully connected networks. Although our model did not perform particularly well on the English only setting, we found that it generalized better to other languages in a zero-shot context than most other models. Some additional experiments were performed to explain this. Findings suggest that the RNN generating the restructured labels generalized well across languages, but preprocessing did not seem to play a role. We conclude by giving suggestions for future improvements of our core idea.

For our submission for Subtask 1, we developed a custom classification head that is designed to be applied atop of a Large Language Model. We reconstructed the hierarchy across multiple fully connected layers, allowing us to incorporate previous foundational decisions in subsequent, more fine-grained layers. To find the best hyperparameters, we conducted a grid-search and to compete in the multilingual setting, we translated all documents to English.

pdf bib abs
D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models
Duygu Altinok

Large language models (LLMs) have garnered significant attention and widespread usage due to their impressive performance in various tasks. However, they are not without their own set of challenges, including issues such as hallucinations, factual inconsistencies, and limitations in numerical-quantitative reasoning. Evaluating LLMs in miscellaneous reasoning tasks remains an active area of research. Prior to the breakthrough of LLMs, Transformers had already proven successful in the medical domain, effectively employed for various natural language understanding (NLU) tasks. Following this trend, LLMs have also been trained and utilized in the medical domain, raising concerns regarding factual accuracy, adherence tosafety protocols, and inherent limitations. In this paper, we focus on evaluating the natural language inference capabilities of popular open-source and closed-source LLMs using clinical trial reports as the dataset. We present the performance results of each LLM and further analyze their performance on a development set, particularly focusing on challenging instances that involve medical abbreviations and require numerical-quantitative reasoning. Gemini, our leading LLM, achieved a test set F1-score of 0.748, securing the ninth position on the task scoreboard. Our work is the first of its kind, offering a thorough examination of the inference capabilities of LLMs within the medical domain.

pdf bib abs
LMEME at SemEval-2024 Task 4: Teacher Student Fusion - Integrating CLIP with LLMs for Enhanced Persuasion Detection
Shiyi Li | Yike Wang | Liang Yang | Shaowu Zhang | Hongfei Lin

This paper describes our system used in the SemEval-2024 Task 4 Multilingual Detection of Persuasion Techniques in Memes. Our team proposes a detection system that employs a Teacher Student Fusion framework. Initially, a Large Language Model serves as the teacher, engaging in abductive reasoning on multimodal inputs to generate background knowledge on persuasion techniques, assisting in the training of a smaller downstream model. The student model adopts CLIP as an encoder for text and image features, and we incorporate an attention mechanism for modality alignment. Ultimately, our proposed system achieves a Macro-F1 score of 0.8103, ranking 1st out of 20 on the leaderboard of Subtask 2b in English. In Bulgarian, Macedonian and Arabic, our detection capabilities are ranked 1/15, 3/15 and 14/15.

pdf bib abs
Innovators at SemEval-2024 Task 10: Revolutionizing Emotion Recognition and Flip Analysis in Code-Mixed Texts
Abhay Shanbhag | Suramya Jadhav | Shashank Rathi | Siddhesh Pande | Dipali Kadam

In this paper, we introduce our system for all three tracks of the SemEval 2024 EDiReF Shared Task 10, which focuses on Emotion Recognition in Conversation (ERC) and Emotion Flip Reasoning (EFR) within the domain of conversational analysis. Task-Track 1 (ERC) aims to assign an emotion to each utterance in the Hinglish language from a predefined set of possible emotions. Tracks 2 (EFR) and 3 (EFR) aim to identify the trigger utterance(s) for an emotion flip in a multi-party conversation dialogue in Hinglish and English text, respectively. For Track 1, our study spans both traditional machine learning ensemble techniques, including Decision Trees, SVM, Logistic Regression, and Multinomial NB models, as well as advanced transformer-based models like XLM-Roberta (XLMR), DistilRoberta, and T5 from Hugging Face’s transformer library. In the EFR competition, we developed and proposed two innovative algorithms to tackle the challenges presented in Tracks 2 and 3. Specifically, our team, Innovators, developed a standout algorithm that propelled us to secure the 2nd rank in Track 2, achieving an impressive F1 score of 0.79, and the 7th rank in Track 3, with an F1 score of 0.68.

The development of social platforms has facilitated the proliferation of disinformation, with memes becoming one of the most popular types of propaganda for disseminating disinformation on the internet. Effectively detecting the persuasion techniques hidden within memes is helpful in understanding user-generated content and further promoting the detection of disinformation on the internet. This paper demonstrates the approach proposed by Team DUTIR938 in Subtask 2b of SemEval-2024 Task 4. We propose a dual-channel model based on semi-supervised learning and model ensemble. We utilize CLIP to extract image features, and employ various pretrained language models under task-adaptive pretraining for text feature extraction. To enhance the detection and generalization capabilities of the model, we implement sample data augmentation using semi-supervised pseudo-labeling methods, introduce adversarial training strategies, and design a two-stage global model ensemble strategy. Our proposed method surpasses the provided baseline method, with Macro/Micro F1 values of 0.80910/0.83667 in the English leaderboard. Our submission ranks 3rd/19 in terms of Macro F1 and 1st/19 in terms of Micro F1.

pdf bib abs
ISDS-NLP at SemEval-2024 Task 10: Transformer based neural networks for emotion recognition in conversations
Claudiu Creanga | Liviu P. Dinu

This paper outlines the approach of the ISDS-NLP team in the SemEval 2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF). For Subtask 1 we obtained a weighted F1 score of 0.43 and placed 12 in the leaderboard. We investigate two distinct approaches: Masked Language Modeling (MLM) and Causal Language Modeling (CLM). For MLM, we employ pre-trained BERT-like models in a multilingual setting, fine-tuning them with a classifier to predict emotions. Experiments with varying input lengths, classifier architectures, and fine-tuning strategies demonstrate the effectiveness of this approach. Additionally, we utilize Mistral 7B Instruct V0.2, a state-of-the-art model, applying zero-shot and few-shot prompting techniques. Our findings indicate that while Mistral shows promise, MLMs currently outperform them in sentence-level emotion classification.

pdf bib abs
UMUTeam at SemEval-2024 Task 4: Multimodal Identification of Persuasive Techniques in Memes through Large Language Models
Ronghao Pan | José Antonio García-díaz | Rafael Valencia-garcía

In this manuscript we describe the UMUTeam’s participation in SemEval-2024 Task 4, a shared task to identify different persuasion techniques in memes. The task is divided into three subtasks. One is a multimodal subtask of identifying whether a meme contains persuasion or not. The others are hierarchical multi-label classifications that consider textual content alone or a multimodal setting of text and visual content. This is a multilingual task, and we participated in all three subtasks but we focus only on the English dataset. Our approach is based on a fine-tuning approach with the pre-trained RoBERTa-large model. In addition, for multimodal cases with both textual and visual content, we used the LMM called LlaVa to extract image descriptions and combine them with the meme text. Our system performed well in three subtasks, achieving the tenth best result with an Hierarchical F1 of 64.774%, the fourth best in Subtask 2a with an Hierarchical F1 of 69.003%, and the eighth best in Subtask 2b with a Macro F1 of 78.660%.

This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team.

pdf bib abs
UMUTeam at SemEval-2024 Task 6: Leveraging Zero-Shot Learning for Detecting Hallucinations and Related Observable Overgeneration Mistakes
Ronghao Pan | José Antonio García-díaz | Tomás Bernal-beltrán | Rafael Valencia-garcía

In these working notes we describe the UMUTeam’s participation in SemEval-2024 shared task 6, which aims at detecting grammatically correct output of Natural Language Generation with incorrect semantic information in two different setups: model-aware and model-agnostic tracks. The task is consists of three subtasks with different model setups. Our approach is based on exploiting the zero-shot classification capability of the Large Language Models LLaMa-2, Tulu and Mistral, through prompt engineering. Our system ranked eighteenth in the model-aware setup with an accuracy of 78.4% and 29th in the model-agnostic setup with an accuracy of 76.9333%.

pdf bib abs
DFKI-NLP at SemEval-2024 Task 2: Towards Robust LLMs Using Data Perturbations and MinMax Training
Bhuvanesh Verma | Lisa Raithel

The NLI4CT task at SemEval-2024 emphasizes the development of robust models for Natural Language Inference on Clinical Trial Reports (CTRs) using large language models (LLMs). This edition introduces interventions specifically targeting the numerical, vocabulary, and semantic aspects of CTRs. Our proposed system harnesses the capabilities of the state-of-the-art Mistral model (Jiang et al., 2023), complemented by an auxiliary model, to focus on the intricate input space of the NLI4CT dataset. Through the incorporation of numerical and acronym-based perturbations to the data, we train a robust system capable of handling both semantic-altering and numerical contradiction interventions. Our analysis on the dataset sheds light on the challenging sections of the CTRs for reasoning.

pdf bib abs
UMUTeam at SemEval-2024 Task 8: Combining Transformers and Syntax Features for Machine-Generated Text Detection
Ronghao Pan | José Antonio García-díaz | Pedro José Vivancos-vicente | Rafael Valencia-garcía

These working notes describe the UMUTeam’s participation in Task 8 of SemEval-2024 entitled “Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection”. This shared task aims at identifying machine-generated text in order to mitigate its potential misuse. This shared task is divided into three subtasks: Subtask A, a binary classification task to determine whether a given full-text was written by a human or generated by a machine; Subtask B, a multi-class classification problem to determine, given a full-text, who generated it. It can be written by a human or generated by a specific language model; and Subtask C, mixed human-machine text recognition. We participated in Subtask B, using an approach based on fine-tuning a pre-trained model, such as RoBERTa, combined with syntactic features of the texts. Our system placed 23rd out of a total of 77 participants, with a score of 75.350%, outperforming the baseline.

pdf bib abs
UMUTeam at SemEval-2024 Task 10: Discovering and Reasoning about Emotions in Conversation using Transformers
Ronghao Pan | José Antonio García-díaz | Diego Roldán | Rafael Valencia-garcía

These notes describe the participation of the UMUTeam in EDiReF, the 10th shared task of SemEval 2024. The goal is to develop systems for detecting and inferring emotional changes in the conversation. The task was divided into three related subtasks: (i) Emotion Recognition in Conversation (ERC) in Hindi-English code-mixed conversations, (ii) Emotion Flip Reasoning (EFR) in Hindi-English code-mixed conversations, and (iii) EFR in English conversations. We were involved in all three and our approach is based on a fine-tuning approach with different pre-trained models. After evaluation, we found BERT to be the best model for ERC and EFR and with this model we achieved the thirteenth best result with an F1 score of 43% in Subtask 1, the sixth best in Subtask 2 with an F1 score of 26% and the fifteenth best in Subtask 3 with an F1 score of 22%.

pdf bib abs
TM-TREK at SemEval-2024 Task 8: Towards LLM-Based Automatic Boundary Detection for Human-Machine Mixed Text
Xiaoyan Qu | Xiangfeng Meng

With the increasing prevalence of text gener- ated by large language models (LLMs), there is a growing concern about distinguishing be- tween LLM-generated and human-written texts in order to prevent the misuse of LLMs, such as the dissemination of misleading information and academic dishonesty. Previous research has primarily focused on classifying text as ei- ther entirely human-written or LLM-generated, neglecting the detection of mixed texts that con- tain both types of content. This paper explores LLMs’ ability to identify boundaries in human- written and machine-generated mixed texts. We approach this task by transforming it into a to- ken classification problem and regard the label turning point as the boundary. Notably, our ensemble model of LLMs achieved first place in the ‘Human-Machine Mixed Text Detection’ sub-task of the SemEval’24 Competition Task 8. Additionally, we investigate factors that in- fluence the capability of LLMs in detecting boundaries within mixed texts, including the incorporation of extra layers on top of LLMs, combination of segmentation loss, and the im- pact of pretraining. Our findings aim to provide valuable insights for future research in this area.

pdf bib abs
Team NP_PROBLEM at SemEval-2024 Task 7: Numerical Reasoning in Headline Generation with Preference Optimization
Pawan Rajpoot | Nut Chukamphaeng

While large language models (LLMs) exhibit impressive linguistic abilities, their numerical reasoning skills within real-world contexts re- main under-explored. This paper describes our participation in a headline-generation challenge by Numeval at Semeval 2024, which focused on numerical reasoning. Our system achieved an overall top numerical accuracy of 73.49% on the task. We explore the system’s design choices contributing to this result and analyze common error patterns. Our findings highlight the potential and ongoing challenges of integrat- ing numerical reasoning within large language model-based headline generation.

pdf bib abs
OPDAI at SemEval-2024 Task 6: Small LLMs can Accelerate Hallucination Detection with Weakly Supervised Data
Ze Chen | Chengcheng Wei | Songtan Fang | Jiarong He | Max Gao

This paper mainly describes a unified system for hallucination detection of LLMs, which wins the second prize in the model-agnostic track of the SemEval-2024 Task 6, and also achieves considerable results in the model-aware track. This task aims to detect hallucination with LLMs for three different text-generation tasks without labeled training data. We utilize prompt engineering and few-shot learning to verify the performance of different LLMs on the validation data. Then we select the LLMs with better performance to generate high-quality weakly supervised training data, which not only satisfies the consistency of different LLMs, but also satisfies the consistency of the optimal LLM with different sampling parameters. Furthermore, we finetune different LLMs by using the constructed training data, and finding that a relatively small LLM can achieve a competitive level of performance in hallucination detection, when compared to the large LLMs and the prompt-based approaches using GPT-4.

pdf bib abs
SSN_ARMM at SemEval-2024 Task 10: Emotion Detection in Multilingual Code-Mixed Conversations using LinearSVC and TF-IDF
Rohith Arumugam | Angel Deborah | Rajalakshmi Sivanaiah | Milton R S | Mirnalinee Thankanadar

Our paper explores a task involving the analysis of emotions and triggers within dialogues. We annotate each utterance with an emotion and identify triggers, focusing on binary labeling. We emphasize clear guidelines for replicability and conduct thorough analyses, including multiple system runs and experiments to highlight effective techniques. By simplifying the complexities and detailing clear methodologies, our study contributes to advancing emotion analysis and trigger identification within dialogue systems.

pdf bib abs
TüDuo at SemEval-2024 Task 2: Flan-T5 and Data Augmentation for Biomedical NLI
Veronika Smilga | Hazem Alabiad

This paper explores using data augmentation with smaller language models under 3 billion parameters for the SemEval-2024 Task 2 on Biomedical Natural Language Inference for Clinical Trials. We fine-tune models from the Flan-T5 family with and without using augmented data automatically generated by GPT-3.5-Turbo and find that data augmentation through techniques like synonym replacement, syntactic changes, adding random facts, and meaning reversion improves model faithfulness (ability to change predictions for semantically different inputs) and consistency (ability to give same predictions for semantic preserving changes). However, data augmentation tends to decrease performance on the original dataset distribution, as measured by F1 score. Our best system is the Flan-T5 XL model fine-tuned on the original training data combined with over 6,000 augmented examples. The system ranks in the top 10 for all three metrics.

This paper reports on an innovative approach to Emotion Recognition in Conversation and Emotion Flip Reasoning for the SemEval-2024 competition with a specific focus on analyzing Hindi-English code-mixed language. By integrating Large Language Models (LLMs) with Instruction-based Fine-tuning and Quantized Low-Rank Adaptation (QLoRA), this study introduces innovative techniques like Sentext-height and advanced prompting strategies to navigate the intricacies of emotional analysis in code-mixed conversational data. The results of the proposed work effectively demonstrate its ability to overcome label bias and the complexities of code-mixed languages. Our team achieved ranks of 5, 3, and 3 in tasks 1, 2, and 3 respectively. This study contributes valuable insights and methods for enhancing emotion recognition models, underscoring the importance of continuous research in this field.

pdf bib abs
YNU-HPCC at SemEval-2024 Task 5: Regularized Legal-BERT for Legal Argument Reasoning Task in Civil Procedure
Peng Shi | Jin Wang | Xuejie Zhang

This paper describes the submission of team YNU-HPCC to SemEval-2024 for Task 5: The Legal Argument Reasoning Task in Civil Procedure. The task asks candidates the topic, questions, and answers, classifying whether a given candidate’s answer is correct (True) or incorrect (False). To make a sound judgment, we propose a system. This system is based on fine-tuning the Legal-BERT model that specializes in solving legal problems. Meanwhile,Regularized Dropout (R-Drop) and focal Loss were used in the model. R-Drop is used for data augmentation, and focal loss addresses data imbalances. Our system achieved relatively good results on the competition’s official leaderboard. The code of this paper is available at https://github.com/YNU-PengShi/SemEval-2024-Task5.

Emotion Recognition in Conversation (ERC) in the context of code-mixed Hindi-English interactions is a subtask addressed in SemEval-2024 as Task 10. We made our maiden attempt to solve the problem using natural language processing, machine learning and deep learning techniques, that perform well in properly assigning emotions to individual utterances from a predefined collection. The use of well-proven classifier such as Long Short Term Memory networks improve the model’s efficacy than the BERT and Glove based models. How-ever, difficulties develop in the subtle arena of emotion-flip reasoning in multi-party discussions, emphasizing the importance of specialized methodologies. Our findings shed light on the intricacies of emotion dynamics in code-mixed languages, pointing to potential areas for further research and refinement in multilingual understanding.

pdf bib abs
UIR-ISC at SemEval-2024 Task 3: Textual Emotion-Cause Pair Extraction in Conversations
Hongyu Guo | Xueyao Zhang | Yiyang Chen | Lin Deng | Binyang Li

The goal of Emotion Cause Pair Extraction (ECPE) is to explore the causes of emotion changes and what causes a certain emotion. This paper proposes a three-step learning approach for the task of Textual Emotion-Cause Pair Extraction in Conversations in SemEval-2024 Task 3, named ECSP. We firstly perform data preprocessing operations on the original dataset to construct negative samples. Secondly, we use a pre-trained model to construct token sequence representations with contextual information to obtain emotion prediction. Thirdly, we regard the textual emotion-cause pair extraction task as a machine reading comprehension task, and fine-tune two pre-trained models, RoBERTa and SpanBERT. Our results have achieved good results in the official rankings, ranking 3rd under the strict match with the Strict F1-score of 15.18%, which further shows that our system has a robust performance.

pdf bib abs
YNU-HPCC at SemEval-2024 Task10: Pre-trained Language Model for Emotion Discovery and Reasoning its Flip in Conversation
Chenyi Liang | Jin Wang | Xuejie Zhang

This paper describes the application of fine-tuning pre-trained models for SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF), which requires the prediction of emotions for each utterance in a conversation and the identification of sentences where an emotional flip occurs. This model is built on the DeBERTa transformer model and enhanced for emotion detection and flip reasoning in conversations. It employs specific separators for utterance processing and utilizes specific padding to handle variable-length inputs. Methods such as R-drop, back translation, and focalloss are also employed in the training of my model. The model achieved specific results on the competition’s official leaderboard. The code of this paper is available athttps://github.com/jiaowoobjiuhao/SemEval-2024-task10.

pdf bib abs
YNU-HPCC at SemEval-2024 Task 2: Applying DeBERTa-v3-large to Safe Biomedical Natural Language Inference for Clinical Trials
Rengui Zhang | Jin Wang | Xuejie Zhang

This paper describes the system for the YNU-HPCC team for SemEval2024 Task 2, focusing on Safe Biomedical Natural Language Inference for Clinical Trials. The core challenge of this task lies in discerning the textual entailment relationship between Clinical Trial Reports (CTR) and statements annotated by expert annotators, including the necessity to infer the relationships in texts subjected to semantic interventions accurately. Our approach leverages a fine-tuned DeBERTa-v3-large model augmented with supervised contrastive learning and back-translation techniques. Supervised contrastive learning aims to bolster classification ac-curacy while back-translation enriches the diversity and quality of our training corpus. Our method achieves a decent F1 score. However, the results also indicate a need for further en-hancements in the system’s capacity for deep semantic comprehension, highlighting areas for future refinement. The code of this paper is available at:https://github.com/RGTnuw/RG_YNU-HPCC-at-Semeval2024-Task2.

pdf bib abs
YNU-HPCC at SemEval-2024 Task 1: Self-Instruction Learning with Black-box Optimization for Semantic Textual Relatedness
Weijie Li | Jin Wang | Xuejie Zhang

This paper introduces a system designed for SemEval-2024 Task 1 that focuses on assessing Semantic Textual Relatedness (STR) between sentence pairs, including its multilingual version. STR, which evaluates the coherence of sentences, is distinct from Semantic Textual Similarity (STS). However, Large Language Models (LLMs) such as ERNIE-Bot-turbo, typically trained on STS data, often struggle to differentiate between the two concepts. To address this, we developed a self-instruction method that enhances their performance distinguishing STR, particularly in cases with high STS but low STR. Beginning with a task description, the system generates new task instructions refined through human feedback. It then iteratively enhances these instructions by comparing them to the original and evaluating the differences. Utilizing the Large Language Models’ (LLMs) natural language comprehension abilities, the system aims to produce progressively optimized instructions based on the resulting scores. Through our optimized instructions, ERNIE-Bot-turbo exceeds the performance of conventional models,achieving a score enhancement of 4 to 7% on multilingual development datasets.

pdf bib abs
AAdaM at SemEval-2024 Task 1: Augmentation and Adaptation for Multilingual Semantic Textual Relatedness
Miaoran Zhang | Mingyang Wang | Jesujoba Alabi | Dietrich Klakow

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages. The shared task aims at measuring the semantic textual relatedness between pairs of sentences, with a focus on a range of under-represented languages. In this work, we propose using machine translation for data augmentation to address the low-resource challenge of limited training data. Moreover, we apply task-adaptive pre-training on unlabeled task data to bridge the gap between pre-training and task adaptation. For model training, we investigate both full fine-tuning and adapter-based tuning, and adopt the adapter framework for effective zero-shot cross-lingual transfer. We achieve competitive results in the shared task: our system performs the best among all ranked teams in both subtask A (supervised learning) and subtask C (cross-lingual transfer).

pdf bib abs
BITS Pilani at SemEval-2024 Task 10: Fine-tuning BERT and Llama 2 for Emotion Recognition in Conversation
Dilip Venkatesh | Pasunti Prasanjith | Yashvardhan Sharma

Emotion Recognition in Conversation (ERC)aims to assign an emotion to a dialogue in aconversation between people. The first subtaskof EDiReF shared task aims to assign an emo-tions to a Hindi-English code mixed conversa-tion. For this, our team proposes a system toidentify the emotion based on fine-tuning largelanguage models on the MaSaC dataset. Forour study we have fine tuned 2 LLMs BERTand Llama 2 to perform sequence classificationto identify the emotion of the text.

pdf bib abs
BITS Pilani at SemEval-2024 Task 9: Prompt Engineering with GPT-4 for Solving Brainteasers
Dilip Venkatesh | Yashvardhan Sharma

Solving brainteasers is a task that requires complex reasoning prowess. The increase of research in natural language processing has leadto the development of massive large languagemodels with billions (or trillions) of parameters that are able to solve difficult questionsdue to their advanced reasoning capabilities.The SemEval BRAINTEASER shared tasks consists of sentence and word puzzles along withoptions containing the answer for the puzzle.Our team uses OpenAI’s GPT-4 model alongwith prompt engineering to solve these brainteasers.

pdf bib abs
Bridging Numerical Reasoning and Headline Generation for Enhanced Language Models
Vaishnavi R | Srimathi T | Aarthi S | Harini V

Headline generation becomes a vital tool in the dynamic world of digital media, combining creativity and scientific rigor to engage readers while maintaining accuracy. However, accuracy is currently hampered by numerical integration problems, which affect both abstractive and extractive approaches. Sentences that are extracted from the original material are typically too short to accurately represent complex information. Our research introduces an innovative two-step training technique to tackle these problems, emphasizing the significance of enhanced numerical reasoning in headline development. Promising advances are presented by utilizing text-to-text processing capabilities of the T5 model and advanced NLP approaches like BERT and RoBERTa. With the help of external contributions and our dataset, our Flan-T5 model has been improved to demonstrate how these methods may be used to overcome numerical integration issues and improve the accuracy of headline production.

pdf bib abs
TueSents at SemEval-2024 Task 8: Predicting the Shift from Human Authorship to Machine-generated Output in a Mixed Text
Valentin Pickard | Hoa Do

This paper describes our approach and resultsfor the SemEval 2024 task of identifying thetoken index in a mixed text where a switchfrom human authorship to machine-generatedtext occurs. We explore two BiLSTMs, oneover sentence feature vectors to predict theindex of the sentence containing such a changeand another over character embeddings of thetext. As sentence features, we compute tokencount, mean token length, standard deviationof token length, counts for punctuation andspace characters, various readability scores,word frequency class and word part-of-speechclass counts for each sentence. class counts.The evaluation is performed on mean absoluteerror (MAE) between predicted and actualboundary token index. While our competitionresults were notably below the baseline, theremay still be useful aspects to our approach.

pdf bib abs
TECHSSN1 at SemEval-2024 Task 10: Emotion Classification in Hindi-English Code-Mixed Dialogue using Transformer-based Models
Venkatasai Ojus Yenumulapalli | Pooja Premnath | Parthiban Mohankumar | Rajalakshmi Sivanaiah | Angel Deborah

The increase in the popularity of code mixed languages has resulted in the need to engineer language models for the same . Unlike pure languages, code-mixed languages lack clear grammatical structures, leading to ambiguous sentence constructions. This ambiguity presents significant challenges for natural language processing tasks, including syntactic parsing, word sense disambiguation, and language identification. This paper focuses on emotion recognition of conversations in Hinglish, a mix of Hindi and English, as part of Task 10 of SemEval 2024. The proposed approach explores the usage of standard machine learning models like SVM, MNB and RF, and also BERT-based models for Hindi-English code-mixed data- namely, HingBERT, Hing mBERT and HingRoBERTa for subtask A.

pdf bib abs
SHROOM-INDElab at SemEval-2024 Task 6: Zero- and Few-Shot LLM-Based Classification for Hallucination Detection
Bradley Allen | Fina Polat | Paul Groth

We describe the University of Amsterdam Intelligent Data Engineering Lab team’s entry for the SemEval-2024 Task 6 competition. The SHROOM-INDElab system builds on previous work on using prompt programming and in-context learning with large language models (LLMs) to build classifiers for hallucination detection, and extends that work through the incorporation of context-specific definition of task, role, and target concept, and automated generation of examples for use in a few-shot prompting approach. The resulting system achieved fourth-best and sixth-best performance in the model-agnostic track and model-aware tracks for Task 6, respectively, and evaluation using the validation sets showed that the system’s classification decisions were consistent with those of the crowdsourced human labelers. We further found that a zero-shot approach provided better accuracy than a few-shot approach using automatically generated examples. Code for the system described in this paper is available on Github.

pdf bib abs
I2C-Huelva at SemEval-2024 Task 8: Boosting AI-Generated Text Detection with Multimodal Models and Optimized Ensembles
Alberto Rodero Peña | Jacinto Mata Vazquez | Victoria Pachón Álvarez

With the rise of AI-based text generators, the need for effective detection mechanisms has become paramount. This paper presents new techniques for building adaptable models and optimizing training aspects for identifying synthetically produced texts across multiple generators and domains. The study, divided into binary and multilabel classification tasks, avoids overfitting through strategic training data limitation. A key innovation is the incorporation of multimodal models that blend numerical text features with conventional NLP approaches. The work also delves into optimizing ensemble model combinations via various voting methods, focusing on accuracy as the official metric. The optimized ensemble strategy demonstrates significant efficacy in both subtasks, highlighting the potential of multimodal and ensemble methods in enhancing the robustness of detection systems against emerging text generators.

This paper introduces an approach developed for multimodal meme analysis, specifically targeting the identification of persuasion techniques embedded within memes. Our methodology integrates Large Language Models (LLMs) and contrastive learning image encoders to discern the presence of persuasive elements in memes across diverse platforms. By capitalizing on the contextual understanding facilitated by LLMs and the discriminative power of contrastive learning for image encoding, our framework provides a robust solution for detecting and classifying memes with persuasion techniques. The system was used in Task 4 of Semeval 2024, precisely for Substask 2b (binary classification of presence of persuasion techniques). It showed promising results overall, achieving a Macro-F1=0.7986 on the English test data (i.e., the language the system was trained on) and Macro-F1=0.66777/0.47917/0.5554, respectively, on the other three “surprise” languages proposed by the task organizers, i.e., Bulgarian, North Macedonian and Arabic. The paper provides an overview of the system, along with a discussion of the results obtained and its main limitations.

pdf bib abs
Fired_from_NLP at SemEval-2024 Task 1: Towards Developing Semantic Textual Relatedness Predictor - A Transformer-based Approach
Anik Shanto | Md. Sajid Alam Chowdhury | Mostak Chowdhury | Udoy Das | Hasan Murad

Predicting semantic textual relatedness (STR) is one of the most challenging tasks in the field of natural language processing. Semantic relatedness prediction has real-life practical applications while developing search engines and modern text generation systems. A shared task on semantic textual relatedness has been organized by SemEval 2024, where the organizer has proposed a dataset on semantic textual relatedness in the English language under Shared Task 1 (Track A3). In this work, we have developed models to predict semantic textual relatedness between pairs of English sentences by training and evaluating various transformer-based model architectures, deep learning, and machine learning methods using the shared dataset. Moreover, we have utilized existing semantic textual relatedness datasets such as the stsb multilingual benchmark dataset, the SemEval 2014 Task 1 dataset, and the SemEval 2015 Task 2 dataset. Our findings show that in the SemEval 2024 Shared Task 1 (Track A3), the fine-tuned-STS-BERT model performed the best, scoring 0.8103 on the test set and placing 25th out of all participants.

pdf bib abs
BITS Pilani at SemEval-2024 Task 1: Using text-embedding-3-large and LaBSE embeddings for Semantic Textual Relatedness
Dilip Venkatesh | Sundaresan Raman

Semantic Relatedness of a pair of text (sentences or words) is the degree to which theirmeanings are close. The Track A of the Semantic Textual Relatedness shared task aimsto find the semantic relatedness for the English language along with multiple other lowresource languages with the use of pretrainedlanguage models. We proposes a system tofind the Spearman coefficient of a textual pairusing pretrained embedding models like textembedding-3-large and LaBSE.

In this paper, we present our novel systems developed for the SemEval-2024 hallucination detection task. Our investigation spans a range of strategies to compare model predictions with reference standards, encompassing diverse baselines, the refinement of pre-trained encoders through supervised learning, and an ensemble approaches utilizing several high-performing models. Through these explorations, we introduce three distinct methods that exhibit strong performance metrics. To amplify our training data, we generate additional training samples from unlabelled training subset. Furthermore, we provide a detailed comparative analysis of our approaches. Notably, our premier method achieved a commendable 9th place in the competition’s model-agnostic track and 20th place in model-aware track, highlighting its effectiveness and potential.

pdf bib abs
USTCCTSU at SemEval-2024 Task 1: Reducing Anisotropy for Cross-lingual Semantic Textual Relatedness Task
Jianjian Li | Shengwei Liang | Yong Liao | Hongping Deng | Haiyang Yu

Cross-lingual semantic textual relatedness task is an important research task that addresses challenges in cross-lingual communication and text understanding. It helps establish semantic connections between different languages, crucial for downstream tasks like machine translation, multilingual information retrieval, and cross-lingual text understanding.Based on extensive comparative experiments, we choose the XLM-R-base as our base model and use pre-trained sentence representations based on whitening to reduce anisotropy.Additionally, for the given training data, we design a delicate data filtering method to alleviate the curse of multilingualism. With our approach, we achieve a 2nd score in Spanish, a 3rd in Indonesian, and multiple entries in the top ten results in the competition’s track C. We further do a comprehensive analysis to inspire future research aimed at improving performance on cross-lingual tasks.

pdf bib abs
GreyBox at SemEval-2024 Task 4: Progressive Fine-tuning (for Multilingual Detection of Propaganda Techniques)
Nathan Roll | Calbert Graham

We introduce a novel fine-tuning approach that effectively primes transformer-based language models to detect rhetorical and psychological techniques within internet memes. Our end-to-end system retains multilingual and task-general capacities from pretraining stages while adapting to domain intricacies using an increasingly targeted set of examples– achieving competitive rankings across English, Bulgarian, and North Macedonian. We find that our monolingual post-training regimen is sufficient to improve task performance in 17 language varieties beyond equivalent zero-shot capabilities despite English-only data. To promote further research, we release our code publicly on GitHub.

pdf bib abs
NLU-STR at SemEval-2024 Task 1: Generative-based Augmentation and Encoder-based Scoring for Semantic Textual Relatedness
Sanad Malaysha | Mustafa Jarrar | Mohammed Khalilia

Semantic textual relatedness is a broader concept of semantic similarity. It measures the extent to which two chunks of text convey similar meaning or topics, or share related concepts or contexts. This notion of relatedness can be applied in various applications, such as document clustering and summarizing. SemRel-2024, a shared task in SemEval-2024, aims at reducing the gap in the semantic relatedness task by providing datasets for fourteen languages and dialects including Arabic. This paper reports on our participation in Track A (Algerian and Moroccan dialects) and Track B (Modern Standard Arabic). A BERT-based model is augmented and fine-tuned for regression scoring in supervised track (A), while BERT-based cosine similarity is employed for unsupervised track (B). Our system ranked 1st in SemRel-2024 for MSA with a Spearman correlation score of 0.49. We ranked 5th for Moroccan and 12th for Algerian with scores of 0.83 and 0.53, respectively.

pdf bib abs
scaLAR SemEval-2024 Task 1: Semantic Textual Relatednes for English
Anand Kumar | Hemanth Kumar

This study investigates Semantic TextualRelated- ness (STR) within Natural LanguageProcessing (NLP) through experiments conducted on a dataset from the SemEval-2024STR task. The dataset comprises train instances with three features (PairID, Text, andScore) and test instances with two features(PairID and Text), where sentence pairs areseparated by '/n’ in the Text column. UsingBERT(sentence transformers pipeline), we explore two approaches: one with fine-tuning(Track A: Supervised) and another without finetuning (Track B: UnSupervised). Fine-tuningthe BERT pipeline yielded a Spearman correlation coefficient of 0.803, while without finetuning, a coefficient of 0.693 was attained usingcosine similarity. The study concludes by emphasizing the significance of STR in NLP tasks,highlighting the role of pre-trained languagemodels like BERT and Sentence Transformersin enhancing semantic relatedness assessments.

pdf bib abs
TECHSSN at SemEval-2024 Task 1: Multilingual Analysis for Semantic Textual Relatedness using Boosted Transformer Models
Shreejith Babu G | Ravindran V | Aashika Jetti | Rajalakshmi Sivanaiah | Angel Deborah

This paper presents our approach to SemEval- 2024 Task 1: Semantic Textual Relatedness (STR). Out of the 14 languages provided, we specifically focused on English and Telugu. Our proposal employs advanced natural language processing techniques and leverages the Sentence Transformers library for sentence embeddings. For English, a Gradient Boosting Regressor trained on DistilBERT embeddingsachieves competitive results, while for Telugu, a multilingual model coupled with hyperparameter tuning yields enhanced performance. The paper discusses the significance of semantic relatedness in various languages, highlighting the challenges and nuances encountered. Our findings contribute to the understanding of semantic textual relatedness across diverse linguistic landscapes, providing valuable insights for future research in multilingual natural language processing.

pdf bib abs
Noot Noot at SemEval-2024 Task 7: Numerical Reasoning and Headline Generation
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy

Natural language processing (NLP) modelshave achieved remarkable progress in recentyears, particularly in tasks related to semanticanalysis. However, many existing benchmarksprimarily focus on lexical and syntactic un-derstanding, often overlooking the importanceof numerical reasoning abilities. In this pa-per, we argue for the necessity of incorporatingnumeral-awareness into NLP evaluations andpropose two distinct tasks to assess this capabil-ity: Numerical Reasoning and Headline Gener-ation. We present datasets curated for each taskand evaluate various approaches using both au-tomatic and human evaluation metrics. Ourresults demonstrate the diverse strategies em-ployed by participating teams and highlight thepromising performance of emerging modelslike Mixtral 8x7b instruct. We discuss the im-plications of our findings and suggest avenuesfor future research in advancing numeral-awarelanguage understanding and generation.

pdf bib abs
Fine-tuning Language Models for AI vs Human Generated Text detection
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy

In this paper, we introduce a machine-generated text detection system designed totackle the challenges posed by the prolifera-tion of large language models (LLMs). Withthe rise of LLMs such as ChatGPT and GPT-4,there is a growing concern regarding the po-tential misuse of machine-generated content,including misinformation dissemination. Oursystem addresses this issue by automating theidentification of machine-generated text acrossmultiple subtasks: binary human-written vs.machine-generated text classification, multi-way machine-generated text classification, andhuman-machine mixed text detection. We em-ploy the RoBERTa Base model and fine-tuneit on a diverse dataset encompassing variousdomains, languages, and sources. Throughrigorous evaluation, we demonstrate the effec-tiveness of our system in accurately detectingmachine-generated text, contributing to effortsaimed at mitigating its potential misuse.

pdf bib abs
eagerlearners at SemEval2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure
Hoorieh Sabzevari | Mohammadmostafa Rostamkhani | Sauleh Eetemadi

This study investigates the performance of the zero-shot method in classifying data using three large language models, alongside two models with large input token sizes and the two pre-trained models on legal data. Our main dataset comes from the domain of U.S. civil procedure. It includes summaries of legal cases, specific questions, potential answers, and detailed explanations for why each solution is relevant, all sourced from a book aimed at law students. By comparing different methods, we aimed to understand how effectively they handle the complexities found in legal datasets. Our findings show how well the zero-shot method of large language models can understand complicated data. We achieved our highest F1 score of 64% in these experiments.

The Large Language Models (LLMs) exhibit remarkable ability to generate fluent content across a wide spectrum of user queries. However, this capability has raised concerns regarding misinformation and personal information leakage. In this paper, we present our methods for the SemEval2024 Task8, aiming to detect machine-generated text across various domains in both mono-lingual and multi-lingual contexts. Our study comprehensively analyzes various methods to detect machine-generated text, including statistical, neural, and pre-trained model approaches. We also detail our experimental setup and perform a in-depth error analysis to evaluate the effectiveness of these methods. Our methods obtain an accuracy of 86.9% on the test set of subtask-A mono and 83.7% for subtask-B. Furthermore, we also highlight the challenges and essential factors for consideration in future studies.

pdf bib abs
Pinealai at SemEval-2024 Task 1: Exploring Semantic Relatedness Prediction using Syntactic, TF-IDF, and Distance-Based Features.
Alex Eponon | Luis Ramos Perez

The central aim of this experiment is to establish a system proficient in predicting semantic relatedness between pairs of English texts. Additionally, the study seeks to delve into diverse features capable of enhancing the ability of models to identify semantic relatedness within given sentences. Several strategies have been used that combine TF-IDF, syntactic features, and similarity measures to train machine learning to predict semantic relatedness between pairs of sentences. The results obtained were above the baseline with an approximate Spearman score of 0.84.

We propose a training algorithm based on retrieval-augmented generation (RAG) to obtain the most similar training samples. The training samples obtained are used as a reference to perform contextual learning-based fine-tuning of large language models (LLMs). We use the proposed method to generate headlines and extract numerical values from unstructured text. Models are made aware of the presence of numbers in the unstructured text with extended markup language (XML) tags specifically designed to capture the numbers. The headlines of unstructured text are preprocessed to wrap the number and then presented to the model. A number of mathematical operations are also passed as references to cover the chain-of-thought (COT) approach. Therefore, the model can calculate the final value passed to a mathematical operation. We perform the validation of numbers as a post-processing step to verify whether the numerical value calculated by the model is correct or not. The automatic validation of numbers in the generated headline helped the model achieve the best results in human evaluation among the methods involved.

pdf bib abs
AlphaIntellect at SemEval-2024 Task 6: Detection of Hallucinations in Generated Text
Sohan Choudhury | Priyam Saha | Subharthi Ray | Shankha Das | Dipankar Das

One major issue in natural language generation (NLG) models is detecting hallucinations (semantically inaccurate outputs). This study investigates a hallucination detection system designed for three distinct NLG tasks: definition modeling, paraphrase generation, and machine translation. The system uses feedforward neural networks for classification and SentenceTransformer models for similarity scores and sentence embeddings. Even though the SemEval-2024 benchmark shows good results, there is still room for improvement. Promising paths toward improving performance include considering multi-task learning methods, including strategies for handling out-of-domain data minimizing bias, and investigating sophisticated architectures.

pdf bib abs
YSP at SemEval-2024 Task 1: Enhancing Sentence Relatedness Assessment using Siamese Networks
Yasamin Aali | Sardar Hamidian | Parsa Farinneya

In this paper we present the system for Track A in the SemEval-2024 Task 1: Semantic Textual Relatedness for African and Asian Languages (STR). The proposed system integrates a Siamese Network architecture with pre-trained language models, including BERT, RoBERTa, and the Universal Sentence Encoder (USE). Through rigorous experimentation and analysis, we evaluate the performance of these models across multiple languages. Our findings reveal that the Universal Sentence Encoder excels in capturing semantic similarities, outperforming BERT and RoBERTa in most scenarios. Particularly notable is the USE’s exceptional performance in English and Marathi. These results emphasize the importance of selecting appropriate pre-trained models based on linguistic considerations and task requirements.

pdf bib abs
NootNoot At SemEval-2024 Task 6: Hallucinations and Related Observable Overgeneration Mistakes Detection
Sankalp Bahad | Yash Bhaskar | Parameswari Krishnamurthy

Semantic hallucinations in neural language gen-eration systems pose a significant challenge tothe reliability and accuracy of natural languageprocessing applications. Current neural mod-els often produce fluent but incorrect outputs,undermining the usefulness of generated text.In this study, we address the task of detectingsemantic hallucinations through the SHROOM(Semantic Hallucinations Real Or Mistakes)dataset, encompassing data from diverse NLGtasks such as definition modeling, machinetranslation, and paraphrase generation. We in-vestigate three methodologies: fine-tuning onlabelled training data, fine-tuning on labelledvalidation data, and a zero-shot approach usingthe Mixtral 8x7b instruct model. Our resultsdemonstrate the effectiveness of these method-ologies in identifying semantic hallucinations,with the zero-shot approach showing compet-itive performance without additional training.Our findings highlight the importance of robustdetection mechanisms for ensuring the accu-racy and reliability of neural language genera-tion systems.

pdf bib abs
Transformers at SemEval-2024 Task 5: Legal Argument Reasoning Task in Civil Procedure using RoBERTa
Kriti Singhal | Jatin Bedi

Legal argument reasoning task in civil procedure is a new NLP task utilizing a dataset from the domain of the U.S. civil procedure. The task aims at identifying whether the solution to a question in the legal domain is correct or not. This paper describes the team “Transformers” submission to the Legal Argument Reasoning Task in Civil Procedure shared task at SemEval-2024 Task 5. We use a BERT-based architecture for the shared task. The highest F1-score score and accuracy achieved was 0.6172 and 0.6531 respectively. We secured the 13th rank in the Legal Argument Reasoning Task in Civil Procedure shared task.

pdf bib abs
YNU-HPCC at SemEval-2024 Task 7: Instruction Fine-tuning Models for Numerical Understanding and Generation
Kaiyuan Chen | Jin Wang | Xuejie Zhang

This paper presents our systems for Task 7, Numeral-Aware Language Understanding and Generation of SemEval 2024. As participants of Task 7, we engage in all subtasks and implement corresponding systems for each subtask. All subtasks cover three aspects: Quantitative understanding (English), Reading Comprehension of the Numbers in the text (Chinese), and Numeral-Aware Headline Generation (English). Our approach explores employing instruction-tuned models (Flan-T5) or text-to-text models (T5) to accomplish the respective subtasks. We implement the instruction fine-tuning with or without demonstrations and employ similarity-based retrieval or manual methods to construct demonstrations for each example in instruction fine-tuning. Moreover, we reformulate the model’s output into a chain-of-thought format with calculation expressions to enhance its reasoning performance for reasoning subtasks. The competitive results in all subtasks demonstrate the effectiveness of our systems.

The explosive growth of online content demands robust Natural Language Processing (NLP) techniques that can capture nuanced meanings and cultural context across diverse languages. Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Despite its pivotal role, prior NLP research has predominantly focused on English, limiting its applicability across languages. Addressing this gap, our paper dives into capturing deeper connections between sentences beyond simple word overlap. Going beyond English-centric NLP research, we explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more. Leveraging the SemEval-2024 shared task, we explore various language models across three learning paradigms: supervised, unsupervised, and cross-lingual. Our comprehensive methodology gains promising results, demonstrating the effectiveness of our approach. This work aims to not only showcase our achievements but also inspire further research in multilingual STR, particularly for low-resourced languages.

pdf bib abs
SEME at SemEval-2024 Task 2: Comparing Masked and Generative Language Models on Natural Language Inference for Clinical Trials
Mathilde Aguiar | Pierre Zweigenbaum | Nona Naderi

This paper describes our submission to Task 2 of SemEval-2024: Safe Biomedical Natural Language Inference for Clinical Trials. The Multi-evidence Natural Language Inference for Clinical Trial Data (NLI4CT) consists of a Textual Entailment (TE) task focused on the evaluation of the consistency and faithfulness of Natural Language Inference (NLI) models applied to Clinical Trial Reports (CTR). We test 2 distinct approaches, one based on finetuning and ensembling Masked Language Models and the other based on prompting Large Language Models using templates, in particular, using Chain-Of-Thought and Contrastive Chain-Of-Thought. Prompting Flan-T5-large in a 2-shot setting leads to our best system that achieves 0.57 F1 score, 0.64 Faithfulness, and 0.56 Consistency.

Large language models (LLMs) have recently obtained strong performance on complex reasoning tasks. However, their capabilities in specialized domains like law remain relatively unexplored. We present CLUEDO, a system to tackle a novel legal reasoning task that involves determining if a provided answer correctly addresses a legal question derived from U.S. civil procedure cases. CLUEDO utilizes multiple collaborator models that are trained using multiple-choice prompting to choose the right label and generate explanations. These collaborators are overseen by a final “detective” model that identifies the most accurate answer in a zero-shot manner. Our approach achieves an F1 macro score of 0.74 on the development set and 0.76 on the test set, outperforming individual models. Unlike the powerful GPT-4, CLUEDO provides more stable predictions thanks to the ensemble approach. Our results showcase the promise of tailored frameworks to enhance legal reasoning capabilities in LLMs.

pdf bib abs
Groningen Group E at SemEval-2024 Task 8: Detecting machine-generated texts through pre-trained language models augmented with explicit linguistic-stylistic features
Patrick Darwinkel | Sijbren Van Vaals | Marieke Van Der Holt | Jarno Van Houten

Our approach to detecting machine-generated text for the SemEval-2024 Task 8 combines a wide range of linguistic-stylistic features with pre-trained language models (PLM). Experiments using random forests and PLMs resulted in an augmented DistilBERT system for subtask A and B and an augmented Longformer for subtask C. These systems achieved accuracies of 0.63 and 0.77 for the mono- and multilingual tracks of subtask A, 0.64 for subtask B and a MAE of 26.07 for subtask C. Although lower than the task organizer’s baselines, we demonstrate that linguistic-stylistic features are predictors for whether a text was authored by a model (and if so, which one).

pdf bib abs
Magnum JUCSE at SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes
Adnan Khurshid | Dipankar Das

This paper focuses on the task of detecting persuasion techniques organised in a hierarchy within meme text in multiple languages like English, North Macedonian, Arabic and Bulgarian, exploring the ways in which textual elements contribute to the dissemination of persuasive messages.The main strategy of the system is to train a binary classifier for each node in the hierarchy and predict labels in a top down fashion by seeing the confidence value of the prediction at any node. For each unique label in the hierarchy, a dataset is created from the original dataset which is then used to train the binary classifier for that label

pdf bib abs
Tübingen-CL at SemEval-2024 Task 1: Ensemble Learning for Semantic Relatedness Estimation
Leixin Zhang | Çağrı Çöltekin

The paper introduces our system for SemEval-2024 Task 1, which aims to predict the relatedness of sentence pairs. Operating under the hypothesis that semantic relatedness is a broader concept that extends beyond mere similarity of sentences, our approach seeks to identify useful features for relatedness estimation. We employ an ensemble approach integrating various systems, including statistical textual features and outputs of deep learning models to predict relatedness scores. The findings suggest that semantic relatedness can be inferred from various sources and ensemble models outperform many individual systems in estimating semantic relatedness.

pdf bib abs
SemEval Task 8: A Comparison of Traditional and Neural Models for Detecting Machine Authored Text
Srikar Kashyap Pulipaka | Shrirang Mhalgi | Joseph Larson | Sandra Kübler

Since Large Language Models have reached a stage where it is becoming more and more difficult to distinguish between human and machine written text, there is an increasing need for automated systems to distinguish between them. As part of SemEval Task 8, Subtask A: Binary Human-Written vs. Machine-Generated Text Classification, we explore a variety of machine learning classifiers, from traditional statistical methods, such as Naïve Bayes and Decision Trees, to fine-tuned transformer models, suchas RoBERTa and ALBERT. Our findings show that using a fine-tuned RoBERTa model with optimizedhyperparameters yields the best accuracy. However, the improvement does not translate to the test set because of the differences in distribution in the development and test sets.

pdf bib abs
RACAI at SemEval-2024 Task 10: Combining algorithms for code-mixed Emotion Recognition in Conversation
Sara Niță | Vasile Păiș

Code-mixed emotion recognition constitutes a challenge for NLP research due to the text’s deviation from the traditional grammatical structure of the original languages. This paper describes the system submitted by the RACAI Team for the SemEval 2024 Task 10 - EDiReF subtasks 1: Emotion Recognition in Conversation (ERC) in Hindi-English code-mixed conversations. We propose a system that combines a transformer-based model with two simple neural networks.

pdf bib abs
ROSHA at SemEval-2024 Task 9: BRAINTEASER A Novel Task Defying Common Sense
Mohammadmostafa Rostamkhani | Shayan Mousavinia | Sauleh Eetemadi

In our exploration of SemEval 2024 Task 9, specifically the challenging BRAINTEASER: A Novel Task Defying Common Sense, we employed various strategies for the BRAINTEASER QA task, which encompasses both sentence and word puzzles. In the initial approach, we applied the XLM-RoBERTa model both to the original training dataset and concurrently to the original dataset alongside the BiRdQA dataset and the original dataset alongside RiddleSense for comprehensive model training.Another strategy involved expanding each word within our BiRdQA dataset into a full sentence. This unique perspective aimed to enhance the semantic impact of individual words in our training regimen for word puzzle (WP) riddles. Utilizing ChatGPT-3.5, we extended each word into an extensive sentence, applying this process to all options within each riddle.Furthermore, we explored the implementation of RECONCILE (Round-table conference) using three prominent large language models—ChatGPT, Gemini, and the Mixtral-8x7B Large Language Model (LLM). As a final approach, we leveraged GPT-4 results. Remarkably, our most successful experiment yielded noteworthy results, achieving a score of 0.900 for sentence puzzles (S_ori) and 0.906 for word puzzles (W_ori).

This paper explores semantic textual relatedness (STR) using fine-tuning techniques on the RoBERTa transformer model, focusing on sentence-level STR within Track A (Supervised). The study evaluates the effectiveness of this approach across different languages, with promising results in English and Spanish but encountering challenges in Arabic.

pdf bib abs
DUTh at SemEval 2024 Task 5: A multi-task learning approach for the Legal Argument Reasoning Task in Civil Procedure
Ioannis Maslaris | Avi Arampatzis

Text-generative models have proven to be good reasoners. Although reasoning abilities are mostly observed in larger language models, a number of strategies try to transfer this skill to smaller language models. This paper presents our approach to SemEval 2024 Task-5: The Legal Argument Reasoning Task in Civil Procedure. This shared task aims to develop a system that efficiently handles a multiple-choice question-answering task in the context of the US civil procedure domain. The dataset provides a human-generated rationale for each answer. Given the complexity of legal issues, this task certainly challenges the reasoning abilities of LLMs and AI systems in general. Our work explores fine-tuning an LLM as a correct/incorrect answer classifier. In this context, we are making use of multi-task learning toincorporate the rationales into the fine-tuning process.

pdf bib abs
MAMET at SemEval-2024 Task 7: Supervised Enhanced Reasoning Agent Model
Mahmood Kalantari | Mehdi Feghhi | Taha Khany Alamooti

In the intersection of language understanding and numerical reasoning, a formidable challenge arises in natural language processing (NLP). Our study delves into the realm of NumEval, focusing on numeral-aware language understanding and generation using the QP, QQA and QNLI datasets. We harness the potential of the Orca2 model, Fine-tuning it in both normal and Chain-of-Thought modes with prompt tuning to enhance accuracy. Despite initial conjectures, our findings reveal intriguing disparities in model performance. While standard training methodologies yield commendable accuracy rates. The core contribution of this work lies in its elucidation of the intricate interplay between dataset sequencing and model performance. We expected to achieve a general model with the Fine Tuning model on the QP and QNLI datasets respectively, which has good accuracy in all three datasets. However, this goal was not achieved, and in order to achieve this goal, we introduce our structure.

pdf bib abs
DUTh at SemEval-2024 Task 6: Comparing Pre-trained Models on Sentence Similarity Evaluation for Detecting of Hallucinations and Related Observable Overgeneration Mistakes
Ioanna Iordanidou | Ioannis Maslaris | Avi Arampatzis

In this paper, we present our approach toSemEval-2024 Task 6: SHROOM, a Sharedtask on Hallucinations and Related ObservableOvergeneration Mistakes, which aims to determine weather AI generated text is semanticallycorrect or incorrect. This work is a comparative study of Large Language Models (LLMs)in the context of the task, shedding light ontheir effectiveness and nuances. We present asystem that leverages pre-trained LLMs, suchas LaBSE, T5, and DistilUSE, for binary classification of given sentences into ‘Hallucination’or ‘Not Hallucination’ classes by evaluatingthe model’s output against the reference correct text. Moreover, beyond utilizing labeleddatasets, our methodology integrates syntheticlabel creation in unlabeled datasets, followedby the prediction of test labels.

pdf bib abs
MBZUAI-UNAM at SemEval-2024 Task 1: Sentence-CROBI, a Simple Cross-Bi-Encoder-Based Neural Network Architecture for Semantic Textual Relatedness
Jesus German Ortiz Barajas | Gemma Bel-enguix | Helena Goméz-adorno

The Semantic Textual Relatedness (STR) shared task aims at detecting the degree of semantic relatedness between pairs of sentences on low-resource languages from Afroasiatic, Indoeuropean, Austronesian, Dravidian, and Nigercongo families. We use the Sentence-CROBI architecture to tackle this problem. The model is adapted from its original purpose of paraphrase detection to explore its capacities in a related task with limited resources and in multilingual and monolingual settings. Our approach combines the vector representation of cross-encoders and bi-encoders and possesses high adaptable capacity by combining several pre-trained models. Our system obtained good results on the low-resource languages of the dataset using a multilingual fine-tuning approach.

pdf bib abs
DUTh at SemEval 2024 Task 8: Comparing classic Machine Learning Algorithms and LLM based methods for Multigenerator, Multidomain and Multilingual Machine-Generated Text Detection
Theodora Kyriakou | Ioannis Maslaris | Avi Arampatzis

Text-generative models evolve rapidly nowadays. Although, they are very useful tools for a lot of people, they have also raised concerns for different reasons. This paper presents our work for SemEval2024 Task-8 on 2 out of the 3 subtasks. This shared task aims at finding automatic models for making AI vs. human written text classification easier. Our team, after trying different preprocessing, several Machine Learning algorithms, and some LLMs, ended up with mBERT, XLM-RoBERTa, and BERT for the tasks we submitted. We present both positive and negative methods, so that future researchers are informed about what works and what doesn’t.

pdf bib
Sina Alinejad at SemEval-2024 Task 7: Numeral Prediction using gpt3.5
Sina Alinejad | Erfan Moosavi Monazzah

pdf bib abs
IUSTNLPLAB at SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes
Mohammad Osoolian | Erfan Moosavi Monazzah | Sauleh Eetemadi

This paper outlines our approach to SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes, specifically addressing subtask 1. The study focuses on model fine-tuning using language models, including BERT, GPT-2, and RoBERTa, with the experiment results demonstrating optimal performance with GPT-2. Our system submission achieved a competitive ranking of 17th out of 33 teams in subtask 1, showcasing the effectiveness of the employed methodology in the context of persuasive technique identification within meme texts.

pdf bib abs
PWEITINLP at SemEval-2024 Task 3: Two Step Emotion Cause Analysis
Sofiia Levchenko | Rafał Wolert | Piotr Andruszkiewicz

ECPE (emotion cause pair extraction) task was introduced to solve the shortcomings of ECE (emotion cause extraction). Models with sequential data processing abilities or complex architecture can be utilized to solve this task. Our contribution to solving Subtask 1: Textual Emotion-Cause Pair Extraction in Conversations defined in the SemEval-2024 Task 3: The Competition of Multimodal Emotion Cause Analysis in Conversations is to create a two-step solution to the ECPE task utilizing GPT-3 for emotion classification and SpanBERT for extracting the cause utterances.

pdf bib abs
IUST-NLPLAB at SemEval-2024 Task 9: BRAINTEASER By MPNet (Sentence Puzzle)
Mohammad Hossein Abbaspour | Erfan Moosavi Monazzah | Sauleh Eetemadi

This study addresses a task encompassing two distinct subtasks: Sentence-puzzle and Word-puzzle. Our primary focus lies within the Sentence-puzzle subtask, which involves discerning the correct answer from a set of three options for a given riddle constructed from sentence fragments. We propose four distinct methodologies tailored to address this subtask effectively. Firstly, we introduce a zero-shot approach leveraging the capabilities of the GPT-3.5 model. Additionally, we present three fine-tuning methodologies utilizing MPNet as the underlying architecture, each employing a different loss function. We conduct comprehensive evaluations of these methodologies on the designated task dataset and meticulously document the obtained results. Furthermore, we conduct an in-depth analysis to ascertain the respective strengths and weaknesses of each method. Through this analysis, we aim to provide valuable insights into the challenges inherent to this task domain.

pdf bib abs
iimasNLP at SemEval-2024 Task 8: Unveiling structure-aware language models for automatic generated text identification
Andric Valdez | Fernando Márquez | Jorge Pantaleón | Helena Gómez | Gemma Bel-enguix

Large language models (LLMs) are artificial intelligence systems that can generate text, translate languages, and answer questions in a human-like way. While these advances are impressive, there is concern that LLMs could also be used to generate fake or misleading content. In this work, as a part of our participation in SemEval-2024 Task-8, we investigate the ability of LLMs to identify whether a given text was written by a human or by a specific AI. We believe that human and machine writing style patterns are different from each other, so integrating features at different language levels can help in this classification task. For this reason, we evaluate several LLMs that aim to extract valuable multilevel information (such as lexical, semantic, and syntactic) from the text in their training processing. Our best scores on Sub- taskA (monolingual) and SubtaskB were 71.5% and 38.2% in accuracy, respectively (both using the ConvBERT LLM); for both subtasks, the baseline (RoBERTa) achieved an accuracy of 74%.

pdf bib abs
INGEOTEC at SemEval-2024 Task 10: Bag of Words Classifiers
Daniela Moctezuma | Eric Tellez | Jose Ortiz Bejar | Mireya Paredes

The Emotion Recognition in Conversation subtask aims to predict the emotions of the utterance of a conversation. In its most basic form, one can treat each utterance separately without considering that it is part of a conversation. Using this simplification, one can use any text classification algorithm to tackle this problem. This contribution follows this approach by solving the problem with different text classifiers based on Bag of Words. Nonetheless, the best approach takes advantage of the dynamics of the conversation; however, this algorithm is not statistically different than a Bag of Words with a Linear Support Vector Machine.

pdf bib abs
IIMAS at SemEval-2024 Task 9: A Comparative Approach for Brainteaser Solutions
Cecilia Reyes | Orlando Ramos-flores | Diego Martínez-maqueda

In this document, we detail our participation experience in SemEval-2024 Task 9: BRAINTEASER-A Novel Task Defying Common Sense. We tackled this challenge by applying fine-tuning techniques with pre-trained models (BERT and RoBERTa Winogrande), while also augmenting the dataset with the LLMs ChatGPT and Gemini. We achieved an accuracy of 0.93 with our best model, along with an F1 score of 0.87 for the Entailment class, 0.94 for the Contradiction class, and 0.96 for the Neutral class

pdf bib abs
PetKaz at SemEval-2024 Task 3: Advancing Emotion Classification with an LLM for Emotion-Cause Pair Extraction in Conversations
Roman Kazakov | Kseniia Petukhova | Ekaterina Kochmar

In this paper, we present our submission to the SemEval-2023 Task 3 “The Competition of Multimodal Emotion Cause Analysis in Conversations”, focusing on extracting emotion-cause pairs from dialogs. Specifically, our approach relies on combining fine-tuned GPT-3.5 for emotion classification and using a BiLSTM-based neural network to detect causes. We score 2nd in the ranking for Subtask 1, demonstrating the effectiveness of our approach through one of the highest weighted-average proportional F1 scores recorded at 0.264.

pdf bib abs
SCaLAR at SemEval-2024 Task 8: Unmasking the machine : Exploring the power of RoBERTa Ensemble for Detecting Machine Generated Text
Anand Kumar | Abhin B | Sidhaarth Murali

SemEval SubtaskB, a shared task that is concerned with the detection of text generated by one out of the 5 different models - davinci, bloomz, chatGPT, cohere and dolly. This is an important task considering the boom of generative models in the current day scenario and their ability to draft mails, formal documents, write and qualify exams and many more which keep evolving every passing day. The purpose of classifying text as generated by which pre-trained model helps in analyzing how each of the training data has affected the ability of the model in performing a certain given task. In the proposed approach, data augmentation was done in order to handle lengthier sentences and also labelling them with the same parent label. Upon the augmented data three RoBERTa models were trained on different segments of data which were then ensembled using a voting classifier based on their R2 score to achieve a higher accuracy than the individual models itself. The proposed model achieved an overall validation accuracy of 97.05% and testing accuracy of 76.25%. and our standing was 18th position on the leaderboard.

pdf bib abs
PetKaz at SemEval-2024 Task 8: Can Linguistics Capture the Specifics of LLM-generated Text?
Kseniia Petukhova | Roman Kazakov | Ekaterina Kochmar

In this paper, we present our submission to the SemEval-2024 Task 8 “Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection”, focusing on the detection of machine-generated texts (MGTs) in English. Specifically, our approach relies on combining embeddings from the RoBERTa-base with diversity features and uses a resampled training set. We score 16th from 139 in the ranking for Subtask A, and our results show that our approach is generalizable across unseen models and domains, achieving an accuracy of 0.91.

Language models, particularly generative models, are susceptible to hallucinations, generating outputs that contradict factual knowledgeor the source text. This study explores methodsfor detecting hallucinations in three SemEval2024 Task 6 tasks: Machine Translation, Definition Modeling, and Paraphrase Generation.We evaluate two methods: semantic similaritybetween the generated text and factual references, and an ensemble of language modelsthat judge each other’s outputs. Our resultsshow that semantic similarity achieves moderate accuracy and correlation scores in trial data,while the ensemble method offers insights intothe complexities of hallucination detection butfalls short of expectations. This work highlights the challenges of hallucination detectionand underscores the need for further researchin this critical area.

pdf bib abs
INGEOTEC at SemEval-2024 Task 1: Bag of Words and Transformers
Daniela Moctezuma | Eric Tellez | Mario Graff

Understanding the meaning of a written message is crucial in solving problems related to Natural Language Processing; the relatedness of two or more messages is a semantic problem tackled with supervised and unsupervised learning. This paper outlines our submissions to the Semantic Textual Relatedness (STR) challenge at SemEval 2024, which is devoted to evaluating the degree of semantic similarity and relatedness between two sentences across multiple languages. We use two main strategies in our submissions. The first approach is based on the Bag-of-Word scheme, while the second one uses pre-trained Transformers for text representation. We found some attractive results, especially in cases where different models adjust better to certain languages over others.

pdf bib abs
OctavianB at SemEval-2024 Task 6: An exploration of humanlike qualities of hallucinated LLM texts
Octavian Brodoceanu

The tested method for detection involves utilizing models, trained for differentiating machine-generated text, in order to distinguish between regular and hallucinated sequences. The hypothesis under investigation is that the patterns learned in pretraining will be transferable to the task at hand. The rationale is as follows: the training data of the model is human-written text, therefore deviations from the training set could be detected in this manner.A second method has been added post competition as a further exploration of the dataset involving using the loss of the generation as determined by a pretrained LLM.

pdf bib abs
FI Group at SemEval-2024 Task 8: A Syntactically Motivated Architecture for Multilingual Machine-Generated Text Detection
Maha Ben-fares | Urchade Zaratiana | Simon Hernandez | Pierre Holat

In this paper, we present the description of our proposed system for Subtask A - multilingual track at SemEval-2024 Task 8, which aims to classify if text has been generated by an AI or Human. Our approach treats binary text classification as token-level prediction, with the final classification being the average of token-level predictions. Through the use of rich representations of pre-trained transformers, our model is trained to selectively aggregate information from across different layers to score individual tokens, given that each layer may contain distinct information. Notably, our model demonstrates competitive performance on the test dataset, achieving an accuracy score of 95.8%. Furthermore, it secures the 2nd position in the multilingual track of Subtask A, with a mere 0.1% behind the leading system.

pdf bib abs
Team Innovative at SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection
Surbhi Sharma | Irfan Mansuri

With the widespread adoption of large language models (LLMs), such as ChatGPT and GPT-4, in various domains, concerns regarding their potential misuse, including spreading misinformation and disrupting education, have escalated. The need to discern between human-generated and machine-generated text has become increasingly crucial. This paper addresses the challenge of automatic text classification with a focus on distinguishing between human-written and machine-generated text. Leveraging the robust capabilities of the RoBERTa model, we propose an approach for text classification, termed as RoBERTa hybrid, which involves fine-tuning the pre-trained Roberta model coupled with additional dense layers and softmax activation for authorship attribution. In this paper, we present an approach that leverages Stylometric features, hybrid features, and the output probabilities of a fine-tuned RoBERTa model. Our method achieves a test accuracy of 73% and a validation accuracy of 89%, demonstrating promising advancements in the field of machine-generated text detection. These results mark significant progress in the domain of machine-generated text detection, as evidenced by our 74th position on the leaderboard for Subtask-A of SemEval-2024 Task 8.

pdf bib abs
EURECOM at SemEval-2024 Task 4: Hierarchical Loss and Model Ensembling in Detecting Persuasion Techniques
Youri Peskine | Raphael Troncy | Paolo Papotti

This paper describes the submission of team EURECOM at SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes. We only tackled the first sub-task, consisting of detecting 20 named persuasion techniques in the textual content of memes. We trained multiple BERT-based models (BERT, RoBERTa, BERT pre-trained on harmful detection) using different losses (Cross Entropy, Binary Cross Entropy, Focal Loss and a custom-made hierarchical loss). The best results were obtained by leveraging the hierarchical nature of the data, by outputting ancestor classes and with a hierarchical loss. Our final submission consist of an ensembling of our top-3 best models for each persuasion techniques. We obtain hierarchical F1 scores of 0.655 (English), 0.345 (Bulgarian), 0.442 (North Macedonian) and 0.178 (Arabic) on the test set.

pdf bib abs
TU Wien at SemEval-2024 Task 6: Unifying Model-Agnostic and Model-Aware Techniques for Hallucination Detection
Varvara Arzt | Mohammad Mahdi Azarbeik | Ilya Lasy | Tilman Kerl | Gábor Recski

This paper discusses challenges in Natural Language Generation (NLG), specifically addressing neural networks producing output that is fluent but incorrect, leading to “hallucinations”. The SHROOM shared task involves Large Language Models in various tasks, and our methodology employs both model-agnostic and model-aware approaches for hallucination detection. The limited availability of labeled training data is addressed through automatic label generation strategies. Model-agnostic methods include word alignment and fine-tuning a BERT-based pretrained model, while model-aware methods leverage separate classifiers trained on LLMs’ internal data (layer activations and attention values). Ensemble methods combine outputs through various techniques such as regression metamodels, voting, and probability fusion. Our best performing systems achieved an accuracy of 80.6% on the model-aware track and 81.7% on the model-agnostic track, ranking 3rd and 8th among all systems, respectively.

pdf bib abs
silp_nlp at SemEval-2024 Task 1: Cross-lingual Knowledge Transfer for Mono-lingual Learning
Sumit Singh | Pankaj Goyal | Uma Tiwary

Our team, silp_nlp, participated in all three tracks of SemEval2024 Task 1: Semantic Textual Relatedness (STR). We created systems for a total of 29 subtasks across all tracks: nine subtasks for track A, 10 subtasks for track B, and ten subtasks for track C. To make the most of our knowledge across all subtasks, we used transformer-based pre-trained models, which are known for their strong cross-lingual transferability. For track A, we trained our model in two stages. In the first stage, we focused on multi-lingual learning from all tracks. In the second stage, we fine-tuned the model for individual tracks. For track B, we used a unigram and bigram representation with suport vector regression (SVR) and eXtreme Gradient Boosting (XGBoost) regression. For track C, we again utilized cross-lingual transferability without the use of targeted subtask data. Our work highlights the fact that knowledge gained from all subtasks can be transferred to an individual subtask if the base language model has strong cross-lingual characteristics. Our system ranked first in the Indonesian subtask of Track B (C7) and in the top three for four other subtasks.

pdf bib abs
LastResort at SemEval-2024 Task 3: Exploring Multimodal Emotion Cause Pair Extraction as Sequence Labelling Task
Suyash Vardhan Mathur | Akshett Jindal | Hardik Mittal | Manish Shrivastava

Conversation is the most natural form of human communication, where each utterance can range over a variety of possible emotions. While significant work has been done towards the detection of emotions in text, relatively little work has been done towards finding the cause of the said emotions, especially in multimodal settings. SemEval 2024 introduces the task of Multimodal Emotion Cause Analysis in Conversations, which aims to extract emotions reflected in individual utterances in a conversation involving multiple modalities (textual, audio, and visual modalities) along with the corresponding utterances that were the cause for the emotion. In this paper, we propose models that tackle this task as an utterance labeling and a sequence labeling problem and perform a comparative study of these models, involving baselines using different encoders, using BiLSTM for adding contextual information of the conversation, and finally adding a CRF layer to try to model the inter-dependencies between adjacent utterances more effectively. In the official leaderboard for the task, our architecture was ranked 8th, achieving an F1-score of 0.1759 on the leaderboard.

pdf bib abs
DaVinci at SemEval-2024 Task 9: Few-shot prompting GPT-3.5 for Unconventional Reasoning
Suyash Vardhan Mathur | Akshett Jindal | Manish Shrivastava

While significant work has been done in the field of NLP on vertical thinking, which involves primarily logical thinking, little work has been done towards lateral thinking, which involves looking at problems from an unconventional perspective defying existing conceptions and notions. Towards this direction, SemEval 2024 introduces the task of BRAINTEASER, which involves two types of questions – Sentence Puzzle and Word Puzzle that defy conventional common-sense reasoning and constraints. In this paper, we tackle both the questions using few-shot prompting on GPT-3.5 and gain insights regarding the difference in the nature of the two types of questions. Our prompting strategy placed us 26th on the leaderboard for the Sentence Puzzle and 15th on the Word Puzzle task.

pdf bib abs
MorphingMinds at SemEval-2024 Task 10: Emotion Recognition in Conversation in Hindi-English Code-Mixed Conversations
Monika Vyas

The research focuses on emotion detection in multilingual conversations, particularly in Romanized Hindi and English, with applications in sentiment analysis and mental health assessments. The study employs Machine learning, deep learning techniques, including Transformer-based models like XLM-RoBERTa, for feature extraction and emotion classification. Various experiments are conducted to evaluate model performance, including fine-tuning, data augmentation, and addressing dataset imbalances. The findings highlight challenges and opportunities in emotion detection across languages and emphasize culturally sensitive approaches. The study contributes to advancing emotion analysis in multilingual contexts and provides practical guidance for developing more accurate emotion detection systems.

pdf bib abs
SemanticCUETSync at SemEval-2024 Task 1: Finetuning Sentence Transformer to Find Semantic Textual Relatedness
Md. Sajjad Hossain | Ashraful Islam Paran | Symom Hossain Shohan | Jawad Hossain | Mohammed Moshiul Hoque

Semantic textual relatedness is crucial to Natural Language Processing (NLP). Methodologies often exhibit superior performance in high-resource languages such as English compared to low-resource ones like Marathi, Telugu, and Spanish. This study leverages various machine learning (ML) approaches, including Support Vector Regression (SVR) and Random Forest, deep learning (DL) techniques such as Siamese Neural Networks, and transformer-based models such as MiniLM-L6-v2, Marathi-sbert, Telugu-sentence-bert-nli, and Roberta-bne-sentiment-analysis-es, to assess semantic relatedness across English, Marathi, Telugu, and Spanish. The developed transformer-based methods notably outperformed other models in determining semantic textual relatedness across these languages, achieving a Spearman correlation coefficient of 0.822 (for English), 0.870 (for Marathi), 0.820 (for Telugu), and 0.677 (for Spanish). These results led to our work attaining rankings of 22th (for English), 11th (for Marathi), 11th (for Telegu) and 14th (for Spanish), respectively.

pdf bib abs
IASBS at SemEval-2024 Task 10: Delving into Emotion Discovery and Reasoning in Code-Mixed Conversations
Mehrzad Tareh | Aydin Mohandesi | Ebrahim Ansari

In this paper, we detail the IASBS team’s approach and findings from participating in SemEval-2024 Task 10, “Emotion Discovery and Reasoning in Hindi-English Code-mixed Conversations (EDiReF).” This task encompasses three critical subtasks: Emotion Recognition in Conversation (ERC), and Emotion Flip Reasoning (EFR) in both Hindi-English code-mixed and English dialogues. Our methodology integrates advanced NLP and machine learning techniques, focusing on the unique challenges of code-mixing, such as linguistic diversity and shifts in emotional context. By implementing a robust framework that includes data preprocessing, and feature engineering using models like GPT-4 and DistilBERT, we extend our analysis beyond mere emotion identification to explore the triggers behind emotion flips. This endeavor not only achieved third place on the leaderboard, demonstrating a high proficiency in emotion and flip detection with an F1-Score of 0.70 but also significantly contributed to the advancement of emotional AI. Our findings offer valuable insights into the complex interplay of emotions in communication, showcasing the potential for enhancing applications across various domains, from social media analytics to healthcare, and underscore the importance of understanding emotional dynamics in code-mixed conversations for future research and practical applications.

pdf bib abs
Deja Vu at SemEval 2024 Task 9: A Comparative Study of Advanced Language Models for Commonsense Reasoning
Trina Chakraborty | Marufur Rahman | Omar Riyad

This research systematically forms an impression of the capabilities of advanced language models in addressing the BRAINTEASER task introduced at SemEval 2024, which is specifically designed to explore the models’ proficiency in lateral commonsense reasoning. The task sets forth an array of Sentence and Word Puzzles, carefully crafted to challenge the models with scenarios requiring unconventional thought processes. Our methodology encompasses a holistic approach, incorporating pre-processing of data, fine-tuning of transformer-based language models, and strategic data augmentation to explore the depth and flexibility of each model’s understanding. The preliminary results of our analysis are encouraging, highlighting significant potential for advancements in the models’ ability to engage in lateral reasoning. Further insights gained from post-competition evaluations suggest scopes for notable enhancements in model performance, emphasizing the continuous evolution of the models in mastering complex reasoning tasks.

pdf bib abs
FtG-CoT at SemEval-2024 Task 9: Solving Sentence Puzzles Using Fine-Tuned Language Models and Zero-Shot CoT Prompting
Micah Zhang | Shafiuddin Rehan Ahmed | James H. Martin

Recent large language models (LLMs) can solve puzzles that require creativity and lateral thinking. To advance this front of research, we tackle SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. We approach this task by introducing a technique that we call Fine-tuned Generated Chain-of-Thought (FtG-CoT). It is a novel few-shot prompting method that combines a fine-tuned BERT classifier encoder with zero-shot chain-of-thought generation and a fine-tuned LLM. The fine-tuned BERT classifier provides a context-rich encoding of each example question and choice list. Zero-shot chain-of-thought generation leverages the benefits of chain-of-thought prompting without requiring manual creation of the reasoning chains. We fine-tune the LLM on the generated chains-of-thought and include a set of generated reasoning chains in the final few-shot LLM prompt to maximize the relevance and correctness of the final generated response. In this paper, we show that FtG-CoT outperforms the zero-shot prompting baseline presented in the task paper and is highly effective at solving challenging sentence puzzles achieving a perfect score on the practice set and a 0.9 score on the evaluation set.

pdf bib abs
LyS at SemEval-2024 Task 3: An Early Prototype for End-to-End Multimodal Emotion Linking as Graph-Based Parsing
Ana Ezquerro | David Vilares

This paper describes our participation in SemEval 2024 Task 3, which focused on Multimodal Emotion Cause Analysis in Conversations. We developed an early prototype for an end-to-end system that uses graph-based methods from dependency parsing to identify causal emotion relations in multi-party conversations. Our model comprises a neural transformer-based encoder for contextualizing multimodal conversation data and a graph-based decoder for generating the adjacency matrix scores of the causal graph. We ranked 7th out of 15 valid and official submissions for Subtask 1, using textual inputs only. We also discuss our participation in Subtask 2 during post-evaluation using multi-modal inputs.

pdf bib abs
NumDecoders at SemEval-2024 Task 7: FlanT5 and GPT enhanced with CoT for Numerical Reasoning
Andres Gonzalez | Md Zobaer Hossain | Jahedul Alam Junaed

In this paper we present a Chain-of-Thought enhanced solution for large language models, including flanT5 and GPT 3.5 Turbo, aimed at solving mathematical problems to fill in blanks from news headlines. Our approach builds on adata augmentation strategy that incorporates additional mathematical reasoning observations into the original dataset sourced from another mathematical corpus. Both automatic and manual annotations are applied to explicitly describe the reasoning steps required for models to reach the target answer. We employ an ensemble majority voting method to generate finalpredictions across our best-performing models. Our analysis reveals that while larger models trained with our enhanced dataset achieve significant gains (91% accuracy, ranking 5th on the NumEval Task 3 leaderboard), smaller models do not experience improvements and may even see a decrease in overall accuracy. We conclude that improving our automatic an-notations via crowdsourcing methods can be a worthwhile endeavor to train larger models than the ones from this study to see the most accurate results.

pdf bib abs
FZI-WIM at SemEval-2024 Task 2: Self-Consistent CoT for Complex NLI in Biomedical Domain
Jin Liu | Steffen Thoma

This paper describes the inference system of FZI-WIM at the SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. Our system utilizes the chain of thought (CoT) paradigm to tackle this complex reasoning problem and further improve the CoT performance with self-consistency. Instead of greedy decoding, we sample multiple reasoning chains with the same prompt and make thefinal verification with majority voting. The self-consistent CoT system achieves a baseline F1 score of 0.80 (1st), faithfulness score of 0.90 (3rd), and consistency score of 0.73 (12th). We release the code and data publicly.

pdf bib abs
Lisbon Computational Linguists at SemEval-2024 Task 2: Using a Mistral-7B Model and Data Augmentation
Artur Guimarães | Bruno Martins | João Magalhães

ABSTRACT: This paper describes our approach to the SemEval-2024 safe biomedical Natural Language Inference for Clinical Trials (NLI4CT) task, which concerns classifying statements about Clinical Trial Reports (CTRs). We explored the capabilities of Mistral-7B, a generalistic open-source Large Language Model (LLM). We developed a prompt for the NLI4CT task, and fine-tuning a quantized version of the model using a slightly augmented version of the training dataset. The experimental results show that this approach can produce notable results in terms of the macro F1-score, while having limitations in terms of faithfulness and consistency. All the developed code is publicly available on a GitHub repository.

pdf bib abs
GIL-IIMAS UNAM at SemEval-2024 Task 1: SAND: An In Depth Analysis of Semantic Relatedness Using Regression and Similarity Characteristics
Francisco Lopez-ponce | Ángel Cadena | Karla Salas-jimenez | Gemma Bel-enguix | David Preciado-márquez

The STR shared task aims at detecting the degree of semantic relatedness between sentence pairs in multiple languages. Semantic relatedness relies on elements such as topic similarity, point of view agreement, entailment, and even human intuition, making it a broader field than sentence similarity. The GIL-IIMAS UNAM team proposes a model based in the SAND characteristics composition (Sentence Transformers, AnglE Embeddings, N-grams, Sentence Length Difference coefficient) and classical regression algorithms. This model achieves a 0.83 Spearman Correlation score in the English test, and a 0.73 in the Spanish counterpart, finishing just above the SemEval baseline in English, and second place in Spanish.

pdf bib abs
Team UTSA-NLP at SemEval 2024 Task 5: Prompt Ensembling for Argument Reasoning in Civil Procedures with GPT4
Dan Schumacher | Anthony Rios

In this paper, we present our system for the SemEval Task 5, The Legal Argument Reasoning Task in Civil Procedure Challenge. Legal argument reasoning is an essential skill that all law students must master. Moreover, it is important to develop natural language processing solutions that can reason about a question given terse domain-specific contextual information. Our system explores a prompt-based solution using GPT4 to reason over legal arguments. We also evaluate an ensemble of prompting strategies, including chain-of-thought reasoning and in-context learning. Overall, our system results in a Macro F1 of .8095 on the validation dataset and .7315 (5th out of 21 teams) on the final test set. Code for this project is available at https://github.com/danschumac1/CivilPromptReasoningGPT4.

pdf bib abs
BD-NLP at SemEval-2024 Task 2: Investigating Generative and Discriminative Models for Clinical Inference with Knowledge Augmentation
Shantanu Nath | Ahnaf Mozib Samin

Healthcare professionals rely on evidence from clinical trial records (CTRs) to devise treatment plans. However, the increasing quantity of CTRs poses challenges in efficiently assimilating the latest evidence to provide personalized evidence-based care. In this paper, we present our solution to the SemEval- 2024 Task 2 titled “Safe Biomedical Natural Language Inference for Clinical Trials”. Given a statement and one/two CTRs as inputs, the task is to determine whether or not the statement entails or contradicts the CTRs. We explore both generative and discriminative large language models (LLM) to investigate their performance for clinical inference. Moreover, we contrast the general-purpose LLMs with the ones specifically tailored for the clinical domain to study the potential advantage in mitigating distributional shifts. Furthermore, the benefit of augmenting additional knowledge within the prompt/statement is examined in this work. Our empirical study suggests that DeBERTa-lg, a discriminative general-purpose natural language inference model, obtains the highest F1 score of 0.77 on the test set, securing the fourth rank on the leaderboard. Intriguingly, the augmentation of knowledge yields subpar results across most cases.

pdf bib abs
NLP at UC Santa Cruz at SemEval-2024 Task 5: Legal Answer Validation using Few-Shot Multi-Choice QA
Anish Pahilajani | Samyak Jain | Devasha Trivedi

This paper presents our submission to the SemEval 2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure. We present two approaches to solving the task of legal answer validation, given an introduction to the case, a question and an answer candidate. Firstly, we fine-tuned pre-trained BERT-based models and found that models trained on domain knowledge perform better. Secondly, we performed few-shot prompting on GPT models and found that reformulating the answer validation task to be a multiple-choice QA task remarkably improves the performance of the model. Our best submission is a BERT-based model that achieved the 7th place out of 20.

Detecting persuasive communication is an important topic in Natural Language Processing (NLP), as it can be useful in identifying fake information on social media. We have developed a system to identify applied persuasion techniques in text fragments across four languages: English, Bulgarian, North Macedonian, and Arabic. Our system uses data augmentation methods and employs an ensemble strategy that combines the strengths of both RoBERTa and DeBERTa models. Due to limited resources, we concentrated solely on task 1, and our solution achieved the top ranking in the English track during the official assessments. We also analyse the impact of architectural decisions, data constructionand training strategies.

pdf bib abs
HaRMoNEE at SemEval-2024 Task 6: Tuning-based Approaches to Hallucination Recognition
Timothy Obiso | Jingxuan Tu | James Pustejovsky

This paper presents the Hallucination Recognition Model for New Experiment Evaluation (HaRMoNEE) team’s winning (#1) and #10 submissions for SemEval-2024 Task 6: Shared- task on Hallucinations and Related Observable Overgeneration Mistakes (SHROOM)’s two subtasks. This task challenged its participants to design systems to detect hallucinations in Large Language Model (LLM) outputs. Team HaRMoNEE proposes two architectures: (1) fine-tuning an off-the-shelf transformer-based model and (2) prompt tuning large-scale Large Language Models (LLMs). One submission from the fine-tuning approach outperformed all other submissions for the model-aware subtask; one submission from the prompt-tuning approach is the 10th-best submission on the leaderboard for the model-agnostic subtask. Our systems also include pre-processing, system-specific tuning, post-processing, and evaluation.

pdf bib abs
VerbaNexAI Lab at SemEval-2024 Task 10: Emotion recognition and reasoning in mixed-coded conversations based on an NRC VAD approach
Santiago Garcia | Elizabeth Martinez | Juan Cuadrado | Juan Martinez-santos | Edwin Puertas

This study introduces an innovative approach to emotion recognition and reasoning about emotional shifts in code-mixed conversations, leveraging the NRC VAD Lexicon and computational models such as Transformer and GRU. Our methodology systematically identifies and categorizes emotional triggers, employing Emotion Flip Reasoning (EFR) and Emotion Recognition in Conversation (ERC). Through experiments with the MELD and MaSaC datasets, we demonstrate the model’s precision in accurately identifying emotional shift triggers and classifying emotions, evidenced by a significant improvement in accuracy as shown by an increase in the F1 score when including VAD analysis. These results underscore the importance of incorporating complex emotional dimensions into conversation analysis, paving new pathways for understanding emotional dynamics in code-mixed texts.

pdf bib abs
VerbaNexAI Lab at SemEval-2024 Task 3: Deciphering emotional causality in conversations using multimodal analysis approach
Victor Pacheco | Elizabeth Martinez | Juan Cuadrado | Juan Carlos Martinez Santos | Edwin Puertas

This study delineates our participation in the SemEval-2024 Task 3: Multimodal Emotion Cause Analysis in Conversations, focusing on developing and applying an innovative methodology for emotion detection and cause analysis in conversational contexts. Leveraging logistic regression, we analyzed conversational utterances to identify emotions per utterance. Subsequently, we employed a dependency analysis pipeline, utilizing SpaCy to extract significant chunk features, including object, subject, adjectival modifiers, and adverbial clause modifiers. These features were analyzed within a graph-like framework, conceptualizing the dependency relationships as edges connecting emotional causes (tails) to their corresponding emotions (heads). Despite the novelty of our approach, the preliminary results were unexpectedly humbling, with a consistent score of 0.0 across all evaluated metrics. This paper presents our methodology, the challenges encountered, and an analysis of the potential factors contributing to these outcomes, offering insights into the complexities of emotion-cause analysis in multimodal conversational data.

pdf bib abs
VerbaNexAI Lab at SemEval-2024 Task 1: A Multilayer Artificial Intelligence Model for Semantic Relationship Detection
Anderson Morillo | Daniel Peña | Juan Carlos Martinez Santos | Edwin Puertas

This paper presents an artificial intelligence model designed to detect semantic relationships in natural language, addressing the challenges of SemEval 2024 Task 1. Our goal is to advance machine understanding of the subtleties of human language through semantic analysis. Using a novel combination of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and an attention mechanism, our model is trained on the STR-2022 dataset. This approach enhances its ability to detect semantic nuances in different texts. The model achieved an 81.92% effectiveness rate and ranked 24th in SemEval 2024 Task 1. These results demonstrate its robustness and adaptability in detecting semantic relationships and validate its performance in diverse linguistic contexts. Our work contributes to natural language processing by providing insights into semantic textual relatedness. It sets a benchmark for future research and promises to inspire innovations that could transform digital language processing and interaction.

pdf bib abs
UMBCLU at SemEval-2024 Task 1: Semantic Textual Relatedness with and without machine translation
Shubhashis Roy Dipta | Sai Vallurupalli

The aim of SemEval-2024 Task 1, “Semantic Textual Relatedness for African and Asian Languages” is to develop models for identifying semantic textual relatedness (STR) between two sentences using multiple languages (14 African and Asian languages) and settings (supervised, unsupervised, and cross-lingual). Large language models (LLMs) have shown impressive performance on several natural language understanding tasks such as multilingual machine translation (MMT), semantic similarity (STS), and encoding sentence embeddings. Using a combination of LLMs that perform well on these tasks, we developed two STR models, TranSem and FineSem, for the supervised and cross-lingual settings. We explore the effectiveness of several training methods and the usefulness of machine translation. We find that direct fine-tuning on the task is comparable to using sentence embeddings and translating to English leads to better performance for some languages. In the supervised setting, our model performance is better than the official baseline for 3 languages with the remaining 4 performing on par. In the cross-lingual setting, our model performance is better than the baseline for 3 languages (leading to 1st place for Africaans and 2nd place for Indonesian), is on par for 2 languages and performs poorly on the remaining 7 languages.

Our paper presents team MasonTigers submission to the SemEval-2024 Task 9 - which provides a dataset of puzzles for testing natural language understanding. We employ large language models (LLMs) to solve this task through several prompting techniques. Zero-shot and few-shot prompting generate reasonably good results when tested with proprietary LLMs, compared to the open-source models. We obtain further improved results with chain-of-thought prompting, an iterative prompting method that breaks down the reasoning process step-by-step. We obtain our best results by utilizing an ensemble of chain-of-thought prompts, placing 2nd in the word puzzle subtask and 13th in the sentence puzzle subtask. The strong performance of prompted LLMs demonstrates their capability for complex reasoning when provided with a decomposition of the thought process. Our work sheds light on how step-wise explanatory prompts can unlock more of the knowledge encoded in the parameters of large models.

This paper presents the MasonTigers entryto the SemEval-2024 Task 8 - Multigenerator, Multidomain, and Multilingual BlackBox Machine-Generated Text Detection. Thetask encompasses Binary Human-Written vs.Machine-Generated Text Classification (TrackA), Multi-Way Machine-Generated Text Classification (Track B), and Human-Machine MixedText Detection (Track C). Our best performing approaches utilize mainly the ensemble ofdiscriminator transformer models along withsentence transformer and statistical machinelearning approaches in specific cases. Moreover, Zero shot prompting and fine-tuning ofFLAN-T5 are used for Track A and B.

pdf bib abs
UIC NLP GRADS at SemEval-2024 Task 3: Two-Step Disjoint Modeling for Emotion-Cause Pair Extraction
Sharad Chandakacherla | Vaibhav Bhargava | Natalie Parde

Disentangling underlying factors contributing to the expression of emotion in multimodal data is challenging but may accelerate progress toward many real-world applications. In this paper we describe our approach for solving SemEval-2024 Task #3, Sub-Task #1, focused on identifying utterance-level emotions and their causes using the text available from the multimodal F.R.I.E.N.D.S. television series dataset. We propose to disjointly model emotion detection and causal span detection, borrowing a paradigm popular in question answering (QA) to train our model. Through our experiments we find that (a) contextual utterances before and after the target utterance play a crucial role in emotion classification; and (b) once the emotion is established, detecting the causal spans resulting in that emotion using our QA-based technique yields promising results.

This paper presents the MasonTigers’ entry to the SemEval-2024 Task 1 - Semantic Textual Relatedness. The task encompasses supervised (Track A), unsupervised (Track B), and cross-lingual (Track C) approaches to semantic textual relatedness across 14 languages. MasonTigers stands out as one of the two teams who participated in all languages across the three tracks. Our approaches achieved rankings ranging from 11th to 21st in Track A, from 1st to 8th in Track B, and from 5th to 12th in Track C. Adhering to the task-specific constraints, our best performing approaches utilize an ensemble of statistical machine learning approaches combined with language-specific BERT based models and sentence transformers.

pdf bib abs
RiddleMasters at SemEval-2024 Task 9: Comparing Instruction Fine-tuning with Zero-Shot Approaches
Kejsi Take | Chau Tran

This paper describes our contribution to SemEval 2023 Task 8: Brainteaser. We compared multiple zero-shot approaches using GPT-4, the state of the art model with Mistral-7B, a much smaller open-source LLM. While GPT-4 remains a clear winner in all the zero-shot approaches, we show that finetuning Mistral-7B can achieve comparable, even though marginally lower results.

pdf bib abs
IITK at SemEval-2024 Task 2: Exploring the Capabilities of LLMs for Safe Biomedical Natural Language Inference for Clinical Trials
Shreyasi Mandal | Ashutosh Modi

Large Language models (LLMs) have demonstrated state-of-the-art performance in various natural language processing (NLP) tasks across multiple domains, yet they are prone to shortcut learning and factual inconsistencies. This research investigates LLMs’ robustness, consistency, and faithful reasoning when performing Natural Language Inference (NLI) on breast cancer Clinical Trial Reports (CTRs) in the context of SemEval 2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. We examine the reasoning capabilities of LLMs and their adeptness at logical problem-solving. A comparative analysis is conducted on pre-trained language models (PLMs), GPT-3.5, and Gemini Pro under zero-shot settings using Retrieval-Augmented Generation (RAG) framework, integrating various reasoning chains. The evaluation yields an F1 score of 0.69, consistency of 0.71, and a faithfulness score of 0.90 on the test dataset.

pdf bib abs
PEAR at SemEval-2024 Task 1: Pair Encoding with Augmented Re-sampling for Semantic Textual Relatedness
Tollef Jørgensen

This paper describes a system submitted to the supervised track (Track A) at SemEval-24: Semantic Textual Relatedness for African and Asian Languages. Challenged with datasets of varying sizes, some as small as 800 samples, we observe that the PEAR system, using smaller pre-trained masked language models to process sentence pairs (Pair Encoding), results in models that efficiently adapt to the task.In addition to the simplistic modeling approach, we experiment with hyperparameter optimization and data expansion from the provided training sets using multilingual bi-encoders, sampling a dynamic number of nearest neighbors (Augmented Re-sampling). The final models are lightweight, allowing fast experimentation and integration of new languages.

pdf bib abs
BCAmirs at SemEval-2024 Task 4: Beyond Words: A Multimodal and Multilingual Exploration of Persuasion in Memes
Amirhossein Abaskohi | Amirhossein Dabiriaghdam | Lele Wang | Giuseppe Carenini

Memes, combining text and images, frequently use metaphors to convey persuasive messages, shaping public opinion. Motivated by this, our team engaged in SemEval-2024 Task 4, a hierarchical multi-label classification task designed to identify rhetorical and psychological persuasion techniques embedded within memes. To tackle this problem, we introduced a caption generation step to assess the modality gap and the impact of additional semantic information from images, which improved our result. Our best model utilizes GPT-4 generated captions alongside meme text to fine-tune RoBERTa as the text encoder and CLIP as the image encoder. It outperforms the baseline by a large margin in all 12 subtasks. In particular, it ranked in top-3 across all languages in Subtask 2a, and top-4 in Subtask 2b, demonstrating quantitatively strong performance. The improvement achieved by the introduced intermediate step is likely attributable to the metaphorical essence of images that challenges visual encoders. This highlights the potential for improving abstract visual semantics encoding.

pdf bib abs
Pauk at SemEval-2024 Task 4: A Neuro-Symbolic Method for Consistent Classification of Propaganda Techniques in Memes
Matt Pauk | Maria Leonor Pacheco

Memes play a key role in most modern informa-tion campaigns, particularly propaganda cam-paigns. Identifying the persuasive techniquespresent in memes is an important step in de-veloping systems to recognize and curtail pro-paganda. This work presents a framework toidentify the persuasive techniques present inmemes for the SemEval 2024 Task 4, accordingto a hierarchical taxonomy of propaganda tech-niques. The framework involves a knowledgedistillation method, where the base model is acombination of DeBERTa and ResNET usedto classify the text and image, and the teachermodel consists of a group of weakly enforcedlogic rules that promote the hierarchy of per-suasion techniques. The addition of the logicrule layer for knowledge distillation shows im-provement in respecting the hierarchy of thetaxonomy with a slight boost in performance.

pdf bib abs
Saama Technologies at SemEval-2024 Task 2: Three-module System for NLI4CT Enhanced by LLM-generated Intermediate Labels
Hwanmun Kim | Kamal Raj Kanakarajan | Malaikannan Sankarasubbu

Participating in SemEval 2024 Task 2, we built a three-module system to predict entailment labels for NLI4CT, which consists of a sequence of the query generation module, the query answering module, and the aggregation module. We fine-tuned or prompted each module with the intermediate labels we generated with LLMs, and we optimized the combinations of different modules through experiments. Our system is ranked 19th ~ 24th in the SemEval 2024 Task 2 leaderboard in different metrics. We made several interesting observations regarding the correlation between different metrics and the sensitivity of our system on the aggregation module. We performed the error analysis on our system which can potentially help to improve our system further.

pdf bib abs
AmazUtah_NLP at SemEval-2024 Task 9: A MultiChoice Question Answering System for Commonsense Defying Reasoning
Soumya Mishra | Mina Ghashami

The SemEval 2024 BRAINTEASER task represents a pioneering venture in Natural Language Processing (NLP) by focusing on lateral thinking, a dimension of cognitive reasoning that is often overlooked in traditional linguistic analyses. This challenge comprises of Sentence Puzzle and Word Puzzle sub-tasks and aims to test language models’ capacity for divergent thinking. In this paper, we present our approach to the BRAINTEASER task. We employ a holistic strategy by leveraging cutting-edge pre-trained models in multiple choice architecture, and diversify the training data with Sentence and Word Puzzle datasets. To gain further improvement, we fine-tuned the model with synthetic humor/jokes dataset and the RiddleSense dataset which helped augmenting the model’s lateral thinking abilities. Empirical results show that our approach achieve 92.5% accuracy in Sentence Puzzle subtask and 80.2% accuracy in Word Puzzle subtask.

pdf bib abs
IITK at SemEval-2024 Task 1: Contrastive Learning and Autoencoders for Semantic Textual Relatedness in Multilingual Texts
Udvas Basak | Rajarshi Dutta | Shivam Pandey | Ashutosh Modi

This paper describes our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness. The challenge is focused on automatically detecting the degree of relatedness between pairs of sentences for 14 languages including both high and low-resource Asian and African languages. Our team participated in two subtasks consisting of Track A: supervised and Track B: unsupervised. This paper focuses on a BERT-based contrastive learning and similarity metric based approach primarily for the supervised track while exploring autoencoders for the unsupervised track. It also aims on the creation of a bigram relatedness corpus using negative sampling strategy, thereby producing refined word embeddings.

pdf bib abs
Compos Mentis at SemEval2024 Task6: A Multi-Faceted Role-based Large Language Model Ensemble to Detect Hallucination
Souvik Das | Rohini Srihari

Hallucinations in large language models (LLMs), where they generate fluent but factually incorrect outputs, pose challenges for applications requiring strict truthfulness. This work proposes a multi-faceted approach to detect such hallucinations across various language tasks. We leverage automatic data annotation using a proprietary LLM, fine-tuning of the Mistral-7B-instruct-v0.2 model on annotated and benchmark data, role-based and rationale-based prompting strategies, and an ensemble method combining different model outputs through majority voting. This comprehensive framework aims to improve the robustness and reliability of hallucination detection for LLM generations.

pdf bib abs
NYCU-NLP at SemEval-2024 Task 2: Aggregating Large Language Models in Biomedical Natural Language Inference for Clinical Trials
Lung-hao Lee | Chen-ya Chiou | Tzu-mi Lin

This study describes the model design of the NYCU-NLP system for the SemEval-2024 Task 2 that focuses on natural language inference for clinical trials. We aggregate several large language models to determine the inference relation (i.e., entailment or contradiction) between clinical trial reports and statements that may be manipulated with designed interventions to investigate the faithfulness and consistency of the developed models. First, we use ChatGPT v3.5 to augment original statements in training data and then fine-tune the SOLAR model with all augmented data. During the testing inference phase, we fine-tune the OpenChat model to reduce the influence of interventions and fed a cleaned statement into the fine-tuned SOLAR model for label prediction. Our submission produced a faithfulness score of 0.9236, ranking second of 32 participating teams, and ranked first for consistency with a score of 0.8092.

This paper explores solutions to the challenges posed by the widespread use of LLMs, particularly in the context of identifying human-written versus machine-generated text. Focusing on Subtask B of SemEval 2024 Task 8, we compare the performance of RoBERTa and DeBERTa models. Subtask B involved identifying not only human or machine text but also the specific LLM responsible for generating text, where our DeBERTa model outperformed the RoBERTa baseline by over 10% in leaderboard accuracy. The results highlight the rapidly growing capabilities of LLMs and importance of keeping up with the latest advancements. Additionally, our paper presents visualizations using PCA and t-SNE that showcase the DeBERTa model’s ability to cluster different LLM outputs effectively. These findings contribute to understanding and improving AI methods for detecting machine-generated text, allowing us to build more robust and traceable AI systems in the language ecosystem.

pdf bib abs
Calc-CMU at SemEval-2024 Task 7: Pre-Calc - Learning to Use the Calculator Improves Numeracy in Language Models
Vishruth Veerendranath | Vishwa Shah | Kshitish Ghate

Quantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding. Our code and data are available at https://github.com/calc-cmu/pre-calc.

pdf bib abs
AISPACE at SemEval-2024 task 8: A Class-balanced Soft-voting System for Detecting Multi-generator Machine-generated Text
Renhua Gu | Xiangfeng Meng

SemEval-2024 Task 8 provides a challenge to detect human-written and machine-generated text. There are 3 subtasks for different detection scenarios. This paper proposes a system that mainly deals with Subtask B. It aims to detect if given full text is written by human or is generated by a specific Large Language Model (LLM), which is actually a multi-class text classification task. Our team AISPACE conducted a systematic study of fine-tuning transformer-based models, including encoder-only, decoder-only and encoder-decoder models. We compared their performance on this task and identified that encoder-only models performed exceptionally well. We also applied a weighted Cross Entropy loss function to address the issue of data imbalance of different class samples. Additionally, we employed soft-voting strategy over multi-models ensemble to enhance the reliability of our predictions. Our system ranked top 1 in Subtask B, which sets a state-of-the-art benchmark for this new challenge.

pdf bib abs
SemEval-2024 Task 7: Numeral-Aware Language Understanding and Generation
Chung-chi Chen | Jian-tao Huang | Hen-hsen Huang | Hiroya Takamura | Hsin-hsi Chen

Numbers are frequently utilized in both our daily narratives and professional documents, such as clinical notes, scientific papers, financial documents, and legal court orders. The ability to understand and generate numbers is thus one of the essential aspects of evaluating large language models. In this vein, we propose a collection of datasets in SemEval-2024 Task 7 - NumEval. This collection encompasses several tasks focused on numeral-aware instances, including number prediction, natural language inference, question answering, reading comprehension, reasoning, and headline generation. This paper offers an overview of the dataset and presents the results of all subtasks in NumEval. Additionally, we contribute by summarizing participants’ methods and conducting an error analysis. To the best of our knowledge, NumEval represents one of the early tasks that perform peer evaluation in SemEval’s history. We will further share observations from this aspect and provide suggestions for future SemEval tasks.

pdf bib abs
UCSC NLP at SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
Neng Wan | Steven Au | Esha Ubale | Decker Krogh

We describe SemEval-2024 Task 10: EDiReF consisting of three sub-tasks involving emotion in conversation across Hinglish code-mixed and English datasets. Subtasks include classification of speaker emotion in multiparty conversations (Emotion Recognition in Conversation) and reasoning around shifts in speaker emotion state (Emotion Flip Reasoning). We deployed a BERT model for emotion recognition and two GRU-based models for emotion flip. Our model achieved F1 scores of 0.45, 0.79, and 0.68 for subtasks 1, 2, and 3, respectively.

pdf bib abs
CLULab-UofA at SemEval-2024 Task 8: Detecting Machine-Generated Text Using Triplet-Loss-Trained Text Similarity and Text Classification
Mohammadhossein Rezaei | Yeaeun Kwon | Reza Sanayei | Abhyuday Singh | Steven Bethard

Detecting machine-generated text is a critical task in the era of large language models. In this paper, we present our systems for SemEval-2024 Task 8, which focuses on multi-class classification to discern between human-written and maching-generated texts by five state-of-the-art large language models. We propose three different systems: unsupervised text similarity, triplet-loss-trained text similarity, and text classification. We show that the triplet-loss trained text similarity system outperforms the other systems, achieving 80% accuracy on the test set and surpassing the baseline model for this subtask. Additionally, our text classification system, which takes into account sentence paraphrases generated by the candidate models, also outperforms the unsupervised text similarity system, achieving 74% accuracy.

pdf bib abs
SINAI at SemEval-2024 Task 8: Fine-tuning on Words and Perplexity as Features for Detecting Machine Written Text
Alberto Gutiérrez Megías | L. Alfonso Ureña-lópez | Eugenio Martínez Cámara

This work presents the proposed systems of the SINAI team for the subtask A of the Task 8 in SemEval 2024. We present the evaluation of two disparate systems, and our final submitted system. We claim that the perplexity value of a text may be used as classification signal. Accordingly, we conduct a study on the utility of perplexity for discerning text authorship, and we perform a comparative analysis of the results obtained on the datasets of the task. This comparative evaluation includes results derived from the systems evaluated, such as fine-tuning using an XLM-RoBERTa-Large transformer or using perplexity as a classification criterion. In addition, we discuss the results reached on the test set, where we show that there is large differences among the language probability distribution of the training and test sets. These analysis allows us to open new research lines to improve the detection of machine-generated text.

This paper introduces the system developed by USTC-BUPT for SemEval-2024 Task 8. The shared task comprises three subtasks across four tracks, aiming to develop automatic systems to distinguish between human-written and machine-generated text across various domains, languages and generators. Our system comprises four components: DATeD, LLAM, TLE, and AuDM, which empower us to effectively tackle all subtasks posed by the challenge. In the monolingual track, DATeD improves machine-generated text detection by incorporating a gradient reversal layer and integrating additional domain labels through Domain Adversarial Neural Networks, enhancing adaptation to diverse text domains. In the multilingual track, LLAM employs different strategies based on language characteristics. For English text, the LLM Embeddings approach utilizes embeddings from a proxy LLM followed by a two-stage CNN for classification, leveraging the broad linguistic knowledge captured during pre-training to enhance performance. For text in other languages, the LLM Sentinel approach transforms the classification task into a next-token prediction task, which facilitates easier adaptation to texts in various languages, especially low-resource languages. TLE utilizes the LLM Embeddings method with a minor modification in the classification strategy for subtask B. AuDM employs data augmentation and fine-tunes the DeBERTa model specifically for subtask C. Our system wins the multilingual track and ranks second in the monolingual track. Additionally, it achieves third place in both subtask B and C.

pdf bib abs
ALF at SemEval-2024 Task 9: Exploring Lateral Thinking Capabilities of LMs through Multi-task Fine-tuning
Seyed Ali Farokh | Hossein Zeinali

Recent advancements in natural language processing (NLP) have prompted the development of sophisticated reasoning benchmarks. This paper presents our system for the SemEval 2024 Task 9 competition and also investigates the efficacy of fine-tuning language models (LMs) on BrainTeaser—a benchmark designed to evaluate NLP models’ lateral thinking and creative reasoning abilities. Our experiments focus on two prominent families of pre-trained models, BERT and T5. Additionally, we explore the potential benefits of multi-task fine-tuning on commonsense reasoning datasets to enhance performance. Our top-performing model, DeBERTa-v3-large, achieves an impressive overall accuracy of 93.33%, surpassing human performance.

pdf bib abs
Pollice Verso at SemEval-2024 Task 6: The Roman Empire Strikes Back
Konstantin Kobs | Jan Pfister | Andreas Hotho

We present an intuitive approach for hallucination detection in LLM outputs that is modeled after how humans would go about this task. We engage several LLM “experts” to independently assess whether a response is hallucinated. For this we select recent and popular LLMs smaller than 7B parameters. By analyzing the log probabilities for tokens that signal a positive or negative judgment, we can determine the likelihood of hallucination. Additionally, we enhance the performance of our “experts” by automatically refining their prompts using the recently introduced OPRO framework. Furthermore, we ensemble the replies of the different experts in a uniform or weighted manner, which builds a quorum from the expert replies. Overall this leads to accuracy improvements of up to 10.6 p.p. compared to the challenge baseline. We show that a Zephyr 3B model is well suited for the task. Our approach can be applied in the model-agnostic and model-aware subtasks without modification and is flexible and easily extendable to related tasks.

pdf bib abs
whatdoyoumeme at SemEval-2024 Task 4: Hierarchical-Label-Aware Persuasion Detection using Translated Texts
Nishan Chatterjee | Marko Pranjic | Boshko Koloski | Lidia Pivovarova | Senja Pollak

In this paper, we detail the methodology of team whatdoyoumeme for the SemEval 2024 Task on Multilingual Persuasion Detection in Memes. We integrate hierarchical label information to refine detection capabilities, and employ a cross-lingual approach, utilizing translation to adapt the model to Macedonian, Arabic, and Bulgarian. Our methodology encompasses both the analysis of meme content and extending labels to include hierarchical structure. The effectiveness of the approach is demonstrated through improved model performance in multilingual contexts, highlighting the utility of translation-based methods and hierarchy-aware learning, over traditional baselines.

pdf bib abs
LomonosovMSU at SemEval-2024 Task 4: Comparing LLMs and embedder models to identifying propaganda techniques in the content of memes in English for subtasks No1, No2a, and No2b
Gleb Skiba | Mikhail Pukemo | Dmitry Melikhov | Konstantin Vorontsov

This paper presents the solution of the LomonosovMSU team for the SemEval-2024 Task 4 “Multilingual Detection of Persuasion Techniques in Memes” competition for the English language task. During the task solving process, generative and BERT-like (training classifiers on top of embedder models) approaches were tested for subtask No1, as well as an BERT-like approach on top of multimodal embedder models for subtasks No2a/No2b. The models were trained using datasets provided by the competition organizers, enriched with filtered datasets from previous SemEval competitions. The following results were achieved: 18th place for subtask No1, 9th place for subtask No2a, and 11th place for subtask No2b.

pdf bib abs
AILS-NTUA at SemEval-2024 Task 6: Efficient model tuning for hallucination detection and analysis
Natalia Grigoriadou | Maria Lymperaiou | George Filandrianos | Giorgos Stamou

In this paper, we present our team’s submissions for SemEval-2024 Task-6 - SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. The participants were asked to perform binary classification to identify cases of fluent overgeneration hallucinations. Our experimentation included fine-tuning a pre-trained model on hallucination detection and a Natural Language Inference (NLI) model. The most successful strategy involved creating an ensemble of these models, resulting in accuracy rates of 77.8% and 79.9% on model-agnostic and model-aware datasets respectively, outperforming the organizers’ baseline and achieving notable results when contrasted with the top-performing results in the competition, which reported accuracies of 84.7% and 81.3% correspondingly.

pdf bib abs
JMI at SemEval 2024 Task 3: Two-step approach for multimodal ECAC using in-context learning with GPT and instruction-tuned Llama models
Arefa . | Mohammed Abbas Ansari | Chandni Saxena | Tanvir Ahmad

This paper presents our system development for SemEval-2024 Task 3: “The Competition of Multimodal Emotion Cause Analysis in Conversations”. Effectively capturing emotions in human conversations requires integrating multiple modalities such as text, audio, and video. However, the complexities of these diverse modalities pose challenges for developing an efficient multimodal emotion cause analysis (ECA) system. Our proposed approach addresses these challenges by a two-step framework. We adopt two different approaches in our implementation. In Approach 1, we employ instruction-tuning with two separate Llama 2 models for emotion and cause prediction. In Approach 2, we use GPT-4V for conversation-level video description and employ in-context learning with annotated conversation using GPT 3.5. Our system wins rank 4, and system ablation experiments demonstrate that our proposed solutions achieve significant performance gains.

In this paper, we describe our submission for the NLI4CT 2024 shared task on robust Natural Language Inference over clinical trial reports. Our system is an ensemble of nine diverse models which we aggregate via majority voting. The models use a large spectrum of different approaches ranging from a straightforward Convolutional Neural Network over fine-tuned Large Language Models to few-shot-prompted language models using chain-of-thought reasoning.Surprisingly, we find that some individual ensemble members are not only more accurate than the final ensemble model but also more robust.

pdf bib abs
MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
Reza Sanayei | Abhyuday Singh | Mohammadhossein Rezaei | Steven Bethard

The advent of large language models (LLMs) has revolutionized Natural Language Generation (NLG), offering unmatched text generation capabilities. However, this progress introduces significant challenges, notably hallucinations—semantically incorrect yet fluent outputs. This phenomenon undermines content reliability, as traditional detection systems focus more on fluency than accuracy, posing a risk of misinformation spread.Our study addresses these issues by proposing a unified strategy for detecting hallucinations in neural model-generated text, focusing on the SHROOM task in SemEval 2024. We employ diverse methodologies to identify output divergence from the source content. We utilized Sentence Transformers to measure cosine similarity between source-hypothesis and source-target embeddings, experimented with omitting source content in the cosine similarity computations, and Leveragied LLMs’ In-Context Learning with detailed task prompts as our methodologies. The varying performance of our different approaches across the subtasks underscores the complexity of Natural Language Understanding tasks, highlighting the importance of addressing the nuances of semantic correctness in the era of advanced language models.

This paper describes the architecture of our system developed for participation in Task 3 of SemEval-2024: Multimodal Emotion-Cause Analysis in Conversations. Our project targets the challenges of subtask 2, dedicated to Multimodal Emotion-Cause Pair Extraction with Emotion Category (MECPE-Cat), and constructs a dual-component system tailored to the unique challenges of this task. We divide the task into two subtasks: emotion recognition in conversation (ERC) and emotion-cause pair extraction (ECPE). To address these subtasks, we capitalize on the abilities of Large Language Models (LLMs), which have consistently demonstrated state-of-the-art performance across various natural language processing tasks and domains. Most importantly, we design an approach of emotion-cause-aware instruction-tuning for LLMs, to enhance the perception of the emotions with their corresponding causal rationales. Our method enables us to adeptly navigate the complexities of MECPE-Cat, achieving an average 34.71% F1 score of the task, and securing the 2nd rank on the leaderboard. The code and metadata to reproduce our experiments are all made publicly available.

pdf bib abs
TueCICL at SemEval-2024 Task 8: Resource-efficient approaches for machine-generated text detection
Daniel Stuhlinger | Aron Winkler

Recent developments in the field of NLP have brought large language models (LLMs) to the forefront of both public and research attention. As the use of language generation technologies becomes more widespread, the problem arises of determining whether a given text is machine generated or not. Task 8 at SemEval 2024 consists of a shared task with this exact objective. Our approach aims at developing models and strategies that strike a good balance between performance and model size. We show that it is possible to compete with large transformer-based solutions with smaller systems.

pdf bib abs
GeminiPro at SemEval-2024 Task 9: BrainTeaser on Gemini
Kyu Hyun Choi | Seung-hoon Na

It is known that human thought can be distinguished into lateral and vertical thinking. The development of language models has thus far been focused on evaluating and advancing vertical thinking, while lateral thinking has been somewhat neglected. To foster progress in this area, SemEval has created and distributed a brainteaser dataset based on lateral thinking consist of sentence puzzles and word puzzle QA. In this paper, we test and discuss the performance of the currently known best model, Gemini, on this dataset.

pdf bib abs
Archimedes-AUEB at SemEval-2024 Task 5: LLM explains Civil Procedure
Odysseas Chlapanis | Ion Androutsopoulos | Dimitrios Galanis

The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher’s internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new ‘mutation’ method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.

pdf bib abs
Weighted Layer Averaging RoBERTa for Black-Box Machine-Generated Text Detection
Ayan Datta | Aryan Chandramania | Radhika Mamidi

We propose a novel approach for machine-generated text detection using a RoBERTa model with weighted layer averaging and AdaLoRA for parameter-efficient fine-tuning. Our method incorporates information from all model layers, capturing diverse linguistic cues beyond those accessible from the final layer alone. To mitigate potential overfitting and improve generalizability, we leverage AdaLoRA, which injects trainable low-rank matrices into each Transformer layer, significantly reducing the number of trainable parameters. Furthermore, we employ data mixing to ensure our model encounters text from various domains and generators during training, enhancing its ability to generalize to unseen data. This work highlights the potential of combining layer-wise information with parameter-efficient fine-tuning and data mixing for effective machine-generated text detection.

pdf bib abs
Mast Kalandar at SemEval-2024 Task 8: On the Trail of Textual Origins: RoBERTa-BiLSTM Approach to Detect AI-Generated Text
Jainit Bafna | Hardik Mittal | Suyash Sethia | Manish Shrivastava | Radhika Mamidi

Large Language Models (LLMs) have showcased impressive abilities in generating fluent responses to diverse user queries. However, concerns regarding the potential misuse ofsuch texts in journalism, educational, and academic contexts have surfaced. SemEval 2024introduces the task of Multigenerator, Multidomain, and Multilingual Black-Box MachineGenerated Text Detection, aiming to developautomated systems for identifying machinegenerated text and detecting potential misuse. In this paper, we i) propose a RoBERTaBiLSTM based classifier designed to classifytext into two categories: AI-generated or human ii) conduct a comparative study of ourmodel with baseline approaches to evaluate itseffectiveness. This paper contributes to the advancement of automatic text detection systemsin addressing the challenges posed by machinegenerated text misuse. Our architecture ranked46th on the official leaderboard with an accuracy of 80.83 among 125.

The degree of semantic relatedness of two units of language has long been considered fundamental to understanding meaning. In this paper, we present the system of Huawei Translation Services Center (HW-TSC) for Task 1 of SemEval 2024, which aims to automatically measure the semantic relatedness of sentence pairs in African and Asian languages. The task dataset for this task covers about 14 different languages, These languages originate from five distinct language families and are predominantly spoken in Africa and Asia. For this shared task, we describe our proposed solutions, including ideas and the implementation steps of the task, as well as the outcomes of each experiment on the development dataset. To enhance the performance, we leverage these experimental outcomes and construct an ensemble one. Our results demonstrate that our system achieves impressive performance on test datasets in unsupervised track B and ranked first place for the Punjabi language pair.

Lateral thinking is essential in breaking away from conventional thought patterns and finding innovative solutions to problems. Despite this, language models often struggle with reasoning tasks that require lateral thinking. In this paper, we present our system for SemEval-2024 Task 9’s BrainTeaser challenge, which requires language models to answer brain teaser questions that typically involve lateral reasoning scenarios. Our framework is based on large language models and incorporates a zero-shot prompting method that integrates conceptualizations of automatically detected instances in the question. We also transform the task of question answering into a declarative format to enhance the discriminatory ability of large language models. Our zero-shot evaluation results with ChatGPT indicate that our approach outperforms baselines, including zero-shot and few-shot prompting and chain-of-thought reasoning. Additionally, our system ranks ninth on the official leaderboard, demonstrating its strong performance.

Large Language Models (LLMs) have demonstrated impressive performance on many Natural Language Processing (NLP) tasks. However, their ability to solve more creative, lateral thinking puzzles remains relatively unexplored. In this work, we develop methods to enhance the lateral thinking and puzzle-solving capabilities of LLMs. We curate a dataset of word-type and sentence-type brain teasers requiring creative problem-solving abilities beyond commonsense reasoning. We first evaluate the zero-shot performance of models like GPT-3.5 and GPT-4 on this dataset. To improve their puzzle-solving skills, we employ prompting techniques like providing reasoning clues and chaining multiple examples to demonstrate the desired thinking process. We also fine-tune the state-of-the-art Mixtral 7x8b LLM on ourdataset. Our methods enable the models to achieve strong results, securing 2nd and 3rd places in the brain teaser task. Our work highlights the potential of LLMs in acquiring complex reasoning abilities with the appropriate training. The efficacy of our approaches opens up new research avenues into advancing lateral thinking and creative problem-solving with AI systems.

pdf bib abs
SU-FMI at SemEval-2024 Task 5: From BERT Fine-Tuning to LLM Prompt Engineering - Approaches in Legal Argument Reasoning
Kristiyan Krumov | Svetla Boytcheva | Ivan Koytchev

This paper presents our approach and findings for SemEval-2024 Task 5, focusing on legal argument reasoning. We explored the effectiveness of fine-tuning pre-trained BERT models and the innovative application of large language models (LLMs) through prompt engineering in the context of legal texts. Our methodology involved a combination of techniques to address the challenges posed by legal language processing, including handling long texts and optimizing natural language understanding (NLU) capabilities for the legal domain. Our contributions were validated by achieving a third-place ranking on the SemEval 2024 Task 5 Leaderboard. The results underscore the potential of LLMs and prompt engineering in enhancing legal reasoning tasks, offering insights into the evolving landscape of NLU technologies within the legal field.

pdf bib abs
Challenges at SemEval 2024 Task 7: Contrastive Learning Approach on Numeral-Aware Language Generation
Ali Zhunis | Hao-yun Chuang

Although Large Language Model (LLM) excels on generating headline on ROUGE evaluation, it still fails to reason number and generate news article headline with accurate number. Attending SemEval-2024 Task 7 subtask 3, our team aims on using contrastive loss to increase the understanding of the number from their different expression, and knows to identify between different number and its respective expression. This system description paper uses T5 and BART as the baseline model, comparing its result with and without the constrative loss. The result shows that BART with contrastive loss have excelled all the models, and its performance on the number accuracy has the highest performance among all.

pdf bib abs
Team Bolaca at SemEval-2024 Task 6: Sentence-transformers are all you need
Béla Rösener | Hong-bo Wei | Ilinca Vandici

Our team tackled the SemEval-2024 Task 6, focusing on identifying fluent over-generation hallucinations in NLP outputs. We proposed a pragmatic solution using a logistic regression classifier and a feed-forward ANN, harnessing SBERT embeddings for feature extraction.

pdf bib abs
AIpom at SemEval-2024 Task 8: Detecting AI-produced Outputs in M4
Alexander Shirnin | Nikita Andreev | Vladislav Mikhailov | Ekaterina Artemova

This paper describes AIpom, a system designed to detect a boundary between human-written and machine-generated text (SemEval-2024 Task 8, Subtask C: Human-Machine Mixed Text Detection). We propose a two-stage pipeline combining predictions from an instruction-tuned decoder-only model and encoder-only sequence taggers. AIpom is ranked second on the leaderboard while achieving a Mean Absolute Error of 15.94. Ablation studies confirm the benefits of pipelining encoder and decoder models, particularly in terms of improved performance.

pdf bib abs
CLaC at SemEval-2024 Task 2: Faithful Clinical Trial Inference
Jennifer Marks | Mohammadreza Davari | Leila Kosseim

This paper presents the methodology used for our participation in SemEval 2024 Task 2 (Jullien et al., 2024) – Safe Biomedical Natural Language Inference for Clinical Trials. The task involved Natural Language Inference (NLI) on clinical trial data, where statements were provided regarding information within Clinical Trial Reports (CTRs). These statements could pertain to a single CTR or compare two CTRs, requiring the identification of the inference relation (entailment vs contradiction) between CTR-statement pairs. Evaluation was based on F1, Faithfulness, and Consistency metrics, with priority given to the latter two by the organizers. Our approach aims to maximize Faithfulness and Consistency, guided by intuitive definitions provided by the organizers, without detailed metric calculations. Experimentally, our approach yielded models achieving maximal Faithfulness (top rank) and average Consistency (mid rank) at the expense of F1 (low rank). Future work will focus on refining our approach to achieve a balance among all three metrics.

pdf bib abs
MALTO at SemEval-2024 Task 6: Leveraging Synthetic Data for LLM Hallucination Detection
Federico Borra | Claudio Savelli | Giacomo Rosso | Alkis Koudounas | Flavio Giobergia

In Natural Language Generation (NLG), contemporary Large Language Models (LLMs) face several challenges, such as generating fluent yet inaccurate outputs and reliance on fluency-centric metrics. This often leads to neural networks exhibiting “hallucinations.” The SHROOM challenge focuses on automatically identifying these hallucinations in the generated text. To tackle these issues, we introduce two key components, a data augmentation pipeline incorporating LLM-assisted pseudo-labelling and sentence rephrasing, and a voting ensemble from three models pre-trained on Natural Language Inference (NLI) tasks and fine-tuned on diverse datasets.

pdf bib abs
Maha Bhaashya at SemEval-2024 Task 6: Zero-Shot Multi-task Hallucination Detection
Patanjali Bhamidipati | Advaith Malladi | Manish Shrivastava | Radhika Mamidi

In recent studies, the extensive utilization oflarge language models has underscored the importance of robust evaluation methodologiesfor assessing text generation quality and relevance to specific tasks. This has revealeda prevalent issue known as hallucination, anemergent condition in the model where generated text lacks faithfulness to the source anddeviates from the evaluation criteria. In thisstudy, we formally define hallucination and propose a framework for its quantitative detectionin a zero-shot setting, leveraging our definitionand the assumption that model outputs entailtask and sample specific inputs. In detectinghallucinations, our solution achieves an accuracy of 0.78 in a model-aware setting and 0.61in a model-agnostic setting. Notably, our solution maintains computational efficiency, requiring far less computational resources than other SOTA approaches, aligning with the trendtowards lightweight and compressed models.

This paper presents our solution for subtask A of shared task 8 of SemEval 2024 for classifying human- and machine-written texts in English across multiple domains. We propose a fusion model consisting of RoBERTa based pre-classifier and two MLPs that have been trained to correct the pre-classifier using linguistic features. Our model achieved an accuracy of 85%.

The SemEval-2024 Task 3 presents two subtasks focusing on emotion-cause pair extraction within conversational contexts. Subtask 1 revolves around the extraction of textual emotion-cause pairs, where causes are defined and annotated as textual spans within the conversation. Conversely, Subtask 2 extends the analysis to encompass multimodal cues, including language, audio, and vision, acknowledging instances where causes may not be exclusively represented in the textual data. Our proposed model for emotion-cause analysis is meticulously structured into three core segments: (i) embedding extraction, (ii) cause-pair extraction & emotion classification, and (iii) cause extraction using QA after finding pairs. Leveraging state-of-the-art techniques and fine-tuning on task-specific datasets, our model effectively unravels the intricate web of conversational dynamics and extracts subtle cues signifying causality in emotional expressions. Our team, AIMA, demonstrated strong performance in the SemEval-2024 Task 3 competition. We ranked as the 10th in subtask 1 and the 6th in subtask 2 out of 23 teams.

In this study, we introduce a solution to the SemEval 2024 Task 10 on subtask 1, dedicated to Emotion Recognition in Conversation (ERC) in code-mixed Hindi-English conversations. ERC in code-mixed conversations presents unique challenges, as existing models are typically trained on monolingual datasets and may not perform well on code-mixed data. To address this, we propose a series of models that incorporate both the previous and future context of the current utterance, as well as the sequential information of the conversation. To facilitate the processing of code-mixed data, we developed a Hinglish-to-English translation pipeline to translate the code-mixed conversations into English. We designed four different base models, each utilizing powerful pre-trained encoders to extract features from the input but with varying architectures. By ensembling all of these models, we developed a final model that outperforms all other baselines.

pdf bib abs
Team MGTD4ADL at SemEval-2024 Task 8: Leveraging (Sentence) Transformer Models with Contrastive Learning for Identifying Machine-Generated Text
Huixin Chen | Jan Büssing | David Rügamer | Ercong Nie

This paper outlines our approach to SemEval-2024 Task 8 (Subtask B), which focuses on discerning machine-generated text from human-written content, while also identifying the text sources, i.e., from which Large Language Model (LLM) the target text is generated. Our detection system is built upon Transformer-based techniques, leveraging various pre-trained language models (PLMs), including sentence transformer models. Additionally, we incorporate Contrastive Learning (CL) into the classifier to improve the detecting capabilities and employ Data Augmentation methods. Ultimately, our system achieves a peak accuracy of 76.96% on the test set of the competition, configured using a sentence transformer model integrated with CL methodology.

pdf bib abs
ClusterCore at SemEval-2024 Task 7: Few Shot Prompting With Large Language Models for Numeral-Aware Headline Generation
Monika Singh | Sujit Kumar | Tanveen . | Sanasam Ranbir Singh

The generation of headlines, a crucial aspect of abstractive summarization, aims to compress an entire article into a concise, single line of text despite the effectiveness of modern encoder-decoder models for text generation and summarization tasks. The encoder-decoder model commonly faces challenges in accurately generating numerical content within headlines. This study empirically explored LLMs for numeral-aware headline generation and proposed few-shot prompting with LLMs for numeral-aware headline generations. Experiments conducted on the NumHG dataset and NumEval-2024 test set suggest that fine-tuning LLMs on NumHG dataset enhances the performance of LLMs for numeral aware headline generation. Furthermore, few-shot prompting with LLMs surpassed the performance of fine-tuned LLMs for numeral-aware headline generation.

pdf bib abs
HierarchyEverywhere at SemEval-2024 Task 4: Detection of Persuasion Techniques in Memes Using Hierarchical Text Classifier
Omid Ghahroodi | Ehsaneddin Asgari

Text classification is an important task in natural language processing. Hierarchical Text Classification (HTC) is a subset of text classification task-type. HTC tackles multi-label classification challenges by leveraging tree structures that delineate relationships between classes, thereby striving to enhance classification accuracy through the utilization of inter-class relationships. Memes, as prevalent vehicles of modern communication within social networks, hold immense potential as instruments for propagandistic dissemination due to their profound impact on users. In SemEval-2024 Task 4, the identification of propaganda and its various forms in memes is explored through two sub-tasks: (i) utilizing only the textual component of memes, and (ii) incorporating both textual and pictorial elements. In this study, we address the proposed problem through the lens of HTC, using state-of-the-art hierarchical text classification methodologies to detect propaganda in memes. Our system achieved first place in English Sub-task 2a, underscoring its efficacy in tackling the complexities inherent in propaganda detection within the meme landscape.

pdf bib abs
AILS-NTUA at SemEval-2024 Task 9: Cracking Brain Teasers: Transformer Models for Lateral Thinking Puzzles
Ioannis Panagiotopoulos | George Filandrianos | Maria Lymperaiou | Giorgos Stamou

In this paper, we outline our submission for the SemEval-2024 Task 9 competition: ‘BRAINTEASER: A Novel Task Defying Common Sense’. We engage in both sub-tasks: Sub-task A-Sentence Puzzle and Sub-task B-Word Puzzle. We evaluate a plethora of pre-trained transformer-based language models of different sizes through fine-tuning. Subsequently, we undertake an analysis of their scores and responses to aid future researchers in understanding and utilizing these models effectively. Our top-performing approaches secured competitive positions on the competition leaderboard across both sub-tasks. In the evaluation phase, our best submission attained an average accuracy score of 81.7% in the Sentence Puzzle, and 85.4% in the Word Puzzle, significantly outperforming the best neural baseline (ChatGPT) by more than 20% and 30% respectively.

pdf bib abs
DeepPavlov at SemEval-2024 Task 3: Multimodal Large Language Models in Emotion Reasoning
Julia Belikova | Dmitrii Kosenko

This paper presents the solution of the DeepPavlov team for the Multimodal Sentiment Cause Analysis competition in SemEval-2024 Task 3, Subtask 2 (Wang et al., 2024). In the evaluation leaderboard, our approach ranks 7th with an F1-score of 0.2132. Large Language Models (LLMs) are transformative in their ability to comprehend and generate human-like text. With recent advancements, Multimodal Large Language Models (MLLMs) have expanded LLM capabilities, integrating different modalities such as audio, vision, and language. Our work delves into the state-of-the-art MLLM Video-LLaMA, its associated modalities, and its application to the emotion reasoning downstream task, Multimodal Emotion Cause Analysis in Conversations (MECAC). We investigate the model’s performance in several modes: zero-shot, few-shot, individual embeddings, and fine-tuned, providing insights into their limits and potential enhancements for emotion understanding.

pdf bib abs
iREL at SemEval-2024 Task 9: Improving Conventional Prompting Methods for Brain Teasers
Harshit Gupta | Manav Chaudhary | Shivansh Subramanian | Tathagata Raha | Vasudeva Varma

This paper describes our approach for SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense. The BRAINTEASER task comprises multiple-choice Question Answering designed to evaluate the models’ lateral thinking capabilities. It consists of Sentence Puzzle and Word Puzzle subtasks that require models to defy default commonsense associations and exhibit unconventional thinking. We propose a unique strategy to improve the performance of pre-trained language models, notably the Gemini 1.0 Pro Model, in both subtasks. We employ static and dynamic few-shot prompting techniques and introduce a model-generated reasoning strategy that utilizes the LLM’s reasoning capabilities to improve performance. Our approach demonstrated significant improvements, showing that it performed better than the baseline models by a considerable margin but fell short of performing as well as the human annotators, thus highlighting the efficacy of the proposed strategies.

pdf bib abs
uTeBC-NLP at SemEval-2024 Task 9: Can LLMs be Lateral Thinkers?
Pouya Sadeghi | Amirhossein Abaskohi | Yadollah Yaghoobzadeh

Inspired by human cognition, Jiang et al. 2023 create a benchmark for assessing LLMs’ lateral thinking—thinking outside the box. Building upon this benchmark, we investigate how different prompting methods enhance LLMs’ performance on this task to reveal their inherent power for outside-the-box thinking ability. Through participating in SemEval-2024, task 9, Sentence Puzzle sub-task, we explore prompt engineering methods: chain of thoughts (CoT) and direct prompting, enhancing with informative descriptions, and employing contextualizing prompts using a retrieval augmented generation (RAG) pipeline. Our experiments involve three LLMs including GPT-3.5, GPT-4, and Zephyr-7B-beta. We generate a dataset of thinking paths between riddles and options using GPT-4, validated by humans for quality. Findings indicate that compressed informative prompts enhance performance. Dynamic in-context learning enhances model performance significantly. Furthermore, fine-tuning Zephyr on our dataset enhances performance across other commonsense datasets, underscoring the value of innovative thinking.

pdf bib abs
IITK at SemEval-2024 Task 4: Hierarchical Embeddings for Detection of Persuasion Techniques in Memes
Shreenaga Chikoti | Shrey Mehta | Ashutosh Modi

Memes are one of the most popular types of content used in an online disinformation campaign. They are primarily effective on social media platforms since they can easily reach many users. Memes in a disinformation campaign achieve their goal of influencing the users through several rhetorical and psychological techniques, such as causal oversimplification, name-calling, and smear. The SemEval 2024 Task 4 Multilingual Detection of Persuasion Technique in Memes on identifying such techniques in the memes is divided across three sub-tasks: (1) Hierarchical multi-label classification using only textual content of the meme, (2) Hierarchical multi-label classification using both, textual and visual content of the meme and (3) Binary classification of whether the meme contains a persuasion technique or not using it’s textual and visual content. This paper proposes an ensemble of Class Definition Prediction (CDP) and hyperbolic embeddings-based approaches for this task. We enhance meme classification accuracy and comprehensiveness by integrating HypEmo’s hierarchical label embeddings (Chen et al., 2023) and a multi-task learning framework for emotion prediction. We achieve a hierarchical F1-score of 0.60, 0.67, and 0.48 on the respective sub-tasks.

pdf bib abs
HIT-MI&T Lab at SemEval-2024 Task 6: DeBERTa-based Entailment Model is a Reliable Hallucination Detector
Wei Liu | Wanyao Shi | Zijian Zhang | Hui Huang

This paper describes our submission for SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes. We propose four groups of methods for hallucination detection: 1) Entailment Recognition; 2) Similarity Search; 3) Factuality Verification; 4) Confidence Estimation. The four methods rely on either the semantic relationship between the hypothesis and its source (target) or on the model-aware features during decoding. We participated in both the model-agnostic and model-aware tracks. Our method’s effectiveness is validated by our high rankings 3rd in the model-agnostic track and 5th in the model-aware track. We have released our code on GitHub.

We describe our systems for SemEval-2024 Task 1: Semantic Textual Relatedness. We investigate the correlation between semantic relatedness and semantic similarity. Specifically, we test two hypotheses: (1) similarity is a special case of relatedness, and (2) semantic relatedness is preserved under translation. We experiment with a variety of approaches which are based on explicit semantics, downstream applications, contextual embeddings, large language models (LLMs), as well as ensembles of methods. We find empirical support for our theoretical insights. In addition, our best ensemble system yields highly competitive results in a number of diverse categories. Our code and data are available on GitHub.

In this article, we present an effective system for semeval-2024 task 5. The task involves assessing the feasibility of a given solution in civil litigation cases based on relevant legal provisions, issues, solutions, and analysis. This task demands a high level of proficiency in U.S. law and natural language reasoning. In this task, we designed a self-eval LLM system that simultaneously performs reasoning and self-assessment tasks. We created a confidence interval and a prompt instructing the LLM to output the answer to a question along with its confidence level. We designed a series of experiments to prove the effectiveness of the self-eval mechanism. In order to avoid the randomness of the results, the final result is obtained by voting on three results generated by the GPT-4. Our submission was conducted under zero-resource setting, and we achieved first place in the task with an F1-score of 0.8231 and an accuracy of 0.8673.

pdf bib abs
IITK at SemEval-2024 Task 10: Who is the speaker? Improving Emotion Recognition and Flip Reasoning in Conversations via Speaker Embeddings
Shubham Patel | Divyaksh Shukla | Ashutosh Modi

This paper presents our approach for the SemEval-2024 Task 10: Emotion Discovery and Reasoning its Flip in Conversations. We propose a transformer-based speaker-centric model for the Emotion Flip Reasoning (EFR) task and a masked-memory network along with a speaker participation vector for the Emotion Recognition in Conversations (ERC) task. We propose a Probable Trigger Zone, which is more likely to contain the utterances causing the emotion of a speaker to flip. In EFR, sub-task 3, the proposed approach archives a 5.9 (F1 score) improvement over the provided task baseline. The ablation study results highlight the significance of various design choices in the proposed method.

pdf bib abs
DeepPavlov at SemEval-2024 Task 8: Leveraging Transfer Learning for Detecting Boundaries of Machine-Generated Texts
Anastasia Voznyuk | Vasily Konovalov

The Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection shared task in the SemEval-2024 competition aims to tackle the problem of misusing collaborative human-AI writing. Although there are a lot of existing detectors of AI content, they are often designed to give a binary answer and thus may not be suitable for more nuanced problem of finding the boundaries between human-written and machine-generated texts, while hybrid human-AI writing becomes more and more popular. In this paper, we address the boundary detection problem. Particularly, we present a pipeline for augmenting data for supervised fine-tuning of DeBERTaV3. We receive new best MAE score, according to the leaderboard of the competition, with this pipeline.

pdf bib abs
Bit_numeval at SemEval-2024 Task 7: Enhance Numerical Sensitivity and Reasoning Completeness for Quantitative Understanding
Xinyue Liang | Jiawei Li | Yizhe Yang | Yang Gao

In this paper, we describe the methods used for Quantitative Natural Language Inference (QNLI), and Quantitative Question Answering (QQA) in task1 of Semeval2024 NumEval. The challenge’s focus is to enhance the model’s quantitative understanding consequently improving its performance on certain tasks. We accomplish this task from two perspectives: (1) By integrating real-world numerical comparison data during the supervised fine-tuning (SFT) phase, we enhanced the model’s numerical sensitivity. (2) We develop an innovative reward model scoring mechanism, leveraging reinforcement learning from human feedback (RLHF) techniques to improve the model’s reasoning completeness.

pdf bib abs
MaiNLP at SemEval-2024 Task 1: Analyzing Source Language Selection in Cross-Lingual Textual Relatedness
Shijia Zhou | Huangyan Shan | Barbara Plank | Robert Litschko

This paper presents our system developed for the SemEval-2024 Task 1: Semantic Textual Relatedness (STR), on Track C: Cross-lingual. The task aims to detect semantic relatedness of two sentences from the same languages. For cross-lingual approach we developed a set of linguistics-inspired models trained with several task-specific strategies. We 1) utilize language vectors for selection of donor languages; 2) investigate the multi-source approach for training; 3) use transliteration of non-latin script to study impact of “script gap”; 4) opt machine translation for data augmentation. We additionally compare the performance of XLM-RoBERTa and Furina with the same training strategy. Our submission achieved the first place in the C8 (Kinyarwanda) test.

pdf bib abs
NLP_Team1@SSN at SemEval-2024 Task 1: Impact of language models in Sentence-BERT for Semantic Textual Relatedness in Low-resource Languages
Senthil Kumar | Aravindan Chandrabose | Gokulakrishnan B | Karthikraja Tp

Semantic Textual Relatedness (STR) will provide insight into the limitations of existing models and support ongoing work on semantic representations. Track A in Shared Task-1, provides pairs of sentences with semantic relatedness scores for 9 languages out of which 7 are low-resources. These languages are from four different language families. We developed models for 8 languages (except for Amharic) in Track A, using Sentence Transformers (SBERT) architecture, and fine-tuned them with multilingual and monolingual pre-trained language models (PLM). Our models for English (eng), Algerian Arabic (arq), andKinyarwanda (kin) languages were ranked 12, 5, and 8 respectively. Our submissions are ranked 5th among 40 submissions in Track A with an average Spearman correlation score of 0.74. However, we observed that the usage of monolingual PLMs did not guarantee better than multilingual PLMs in Marathi (mar), and Telugu (tel) languages in our case.

pdf bib abs
ShefCDTeam at SemEval-2024 Task 4: A Text-to-Text Model for Multi-Label Classification
Meredith Gibbons | Maggie Mi | Xingyi Song | Aline Villavicencio

This paper presents our findings for SemEval2024 Task 4. We submit only to subtask 1, applying the text-to-text framework using a FLAN-T5 model with a combination of parameter efficient fine-tuning methods - low-rankadaptation and prompt tuning. Overall, we find that the system performs well in English, but performance is limited in Bulgarian, North Macedonian and Arabic. Our analysis raises interesting questions about the effects of labelorder and label names when applying the text-to-text framework.

pdf bib abs
NLPNCHU at SemEval-2024 Task 4: A Comparison of MDHC Strategy and In-domain Pre-training for Multilingual Detection of Persuasion Techniques in Memes
Shih-wei Guo | Yao-chung Fan

This study presents a systematic method for identifying 22 persuasive techniques used in multilingual memes. We explored various fine-tuning techniques and classification strategies, such as data augmentation, problem transformation, and hierarchical multi-label classification strategies. Identifying persuasive techniques in memes involves a multimodal task. We fine-tuned the XLM-RoBERTA-large-twitter language model, focusing on domain-specific language modeling, and integrated it with the CLIP visual model’s embedding to consider image and text features simultaneously. In our experiments, we evaluated the effectiveness of our approach by using official validation data in English. Our system in the competition, achieving competitive rankings in Subtask1 and Subtask2b across four languages: English, Bulgarian, North Macedonian, and Arabic. Significantly, we achieved 2nd place ranking for Arabic language in Subtask 1.

pdf bib abs
Mothman at SemEval-2024 Task 9: An Iterative System for Chain-of-Thought Prompt Optimization
Alvin Po-Chun Chen | Ray Groshan | Sean von Bayern

Extensive research exists on the performance of large language models on logic-based tasks, whereas relatively little has been done on their ability to generate creative solutions on lateral thinking tasks. The BrainTeaser shared task tests lateral thinking and uses adversarial datasets to prevent memorization, resulting in poor performance for out-of-the-box models. We propose a system for iterative, chain-of-thought prompt engineering which optimizes prompts using human evaluation. Using this shared task, we demonstrate our system’s ability to significantly improve model performance by optimizing prompts and evaluate the input dataset.

pdf bib abs
Zero Shot is All You Need at SemEval-2024 Task 9: A study of State of the Art LLMs on Lateral Thinking Puzzles
Erfan Moosavi Monazzah | Mahdi Feghhi

The successful deployment of large language models in numerous NLP tasks has spurred the demand for tackling more complex tasks, which were previously unattainable. SemEval-2024 Task 9 introduces the brainteaser dataset that necessitates intricate, human-like reasoning to solve puzzles that challenge common sense. At first glance, the riddles in the dataset may appear trivial for humans to solve. However, these riddles demand lateral thinking, which deviates from vertical thinking that is the dominant form when it comes to current reasoning tasks. In this paper, we examine the ability of current state-of-the-art LLMs to solve this task. Our study is diversified by selecting both open and closed source LLMs with varying numbers of parameters. Additionally, we extend the task dataset with synthetic explanations derived from the LLMs’ reasoning processes during task resolution. These could serve as a valuable resource for further expanding the task dataset and developing more robust methods for tasks that require complex reasoning. All the codes and datasets are available in paper’s GitHub repository.

pdf bib abs
Edinburgh Clinical NLP at SemEval-2024 Task 2: Fine-tune your model unless you have access to GPT-4
Aryo Gema | Giwon Hong | Pasquale Minervini | Luke Daines | Beatrice Alex

The NLI4CT task assesses Natural Language Inference systems in predicting whether hypotheses entail or contradict evidence from Clinical Trial Reports. In this study, we evaluate various Large Language Models (LLMs) with multiple strategies, including Chain-of-Thought, In-Context Learning, and Parameter-Efficient Fine-Tuning (PEFT). We propose a PEFT method to improve the consistency of LLMs by merging adapters that were fine-tuned separately using triplet and language modelling objectives. We found that merging the two PEFT adapters improves the F1 score (+0.0346) and consistency (+0.152) of the LLMs. However, our novel methods did not produce more accurate results than GPT-4 in terms of faithfulness and consistency. Averaging the three metrics, GPT-4 ranks joint-first in the competition with 0.8328. Finally, our contamination analysis with GPT-4 indicates that there was no test data leakage. Our code is available at https://github.com/EdinburghClinicalNLP/semeval_nli4ct.

pdf bib abs
CaresAI at SemEval-2024 Task 2: Improving Natural Language Inference in Clinical Trial Data using Model Ensemble and Data Explanation
Reem Abdel-salam | Mary Adewunmi | Mercy Akinwale

Large language models (LLMs) have demonstrated state-of-the-art performance across multiple domains in various natural language tasks. Entailment tasks, however, are more difficult to achieve with a high-performance model. The task is to use safe natural language models to conclude biomedical clinical trial reports (CTRs). The Natural Language Inference for Clinical Trial Data (NLI4CT) task aims to define a given entailment and hypothesis based on CTRs. This paper aims to address the challenges of medical abbreviations and numerical data that can be logically inferred from one another due to acronyms, using different data pre-processing techniques to explain such data. This paper presents a model for NLI4CT SemEval 2024 task 2 that trains the data with DeBERTa, BioLink, BERT, GPT2, BioGPT, and Clinical BERT using the best training approaches, such as fine-tuning, prompt tuning, and contrastive learning. Furthermore, to validate these models, different experiments have been carried out. Our best system is built on an ensemble of different models with different training settings, which achieves an F1 score of 0.77, a faithfulness score of 0.76, and a consistency score of 0.75 and secures the sixth rank in the official leaderboard. In conclusion, this paper has addressed challenges in medical text analysis by exploring various NLP techniques, evaluating multiple advanced natural languagemodels(NLM) models and achieving good results with the ensemble model. Additionally, this project has contributed to the advancement of safe and effective NLMs for analysing complex medical data in CTRs.

pdf bib abs
CVcoders on Semeval-2024 Task 4
Fatemezahra Bakhshande | Mahdieh Naderi

In this paper, we present our methodology for addressing the SemEval 2024 Task 4 on “Multilingual Detection of Persuasion Techniques in Memes.” Our method focuses on identifying persuasion techniques within textual and multimodal meme content using a combination of preprocessing techniques and established models. By integrating advanced preprocessing methods, such as the OpenAI API for text processing, and utilizing a multimodal architecture combining VGG for image feature extraction and GPT-2 for text feature extraction, we achieve improved model performance. To handle class imbalance, we employ Focal Loss as the loss function and AdamW as the optimizer. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in the task. Notably, our system attains an F1 macro score of 0.67 and an F1 micro score of 0.74 on the test dataset, ranking third among all participants in the competition. Our findings highlight the importance of robust preprocessing techniques and model selection in effectively analyzing memes for persuasion techniques, contributing to efforts to combat misinformation on social media platforms.

pdf bib abs
Groningen Team F at SemEval-2024 Task 8: Detecting Machine-Generated Text using Feature-Based Machine Learning Models
Rina Donker | Björn Overbeek | Dennis Thulden | Oscar Zwagers

Large language models (LLMs) have shown remarkable capability of creating fluent responses to a wide variety of user queries. However, this also comes with concerns regarding the spread of misinformation and potential misuse within educational context. In this paper we describe our contribution to SemEval-2024 Task 8 (Wang et al., 2024), a shared task created around detecting machine-generated text. We aim to create several feature-based models that can detect whether a text is machine-generated or human-written. In the end, we obtained an accuracy of 0.74 on the binary human-written vs. machine-generated text classification task (Subtask A monolingual) and an accuracy of 0.61 on the multi-way machine-generated text-classification task (Subtask B). For future work, more features and models could be implemented.

pdf bib abs
Groningen Team A at SemEval-2024 Task 8: Human/Machine Authorship Attribution Using a Combination of Probabilistic and Linguistic Features
Huseyin Alecakir | Puja Chakraborty | Pontus Henningsson | Matthijs Van Hofslot | Alon Scheuer

Our approach primarily centers on feature-based systems, where a diverse array of features pertinent to the text’s linguistic attributes is extracted. Alongside those, we incorporate token-level probabilistic features which are fed into a Bidirectional Long Short-Term Memory (BiLSTM) model. Both resulting feature arrays are concatenated and fed into our final prediction model. Our method under-performed compared to the baseline, despite the fact that previous attempts by others have successfully used linguistic features for the purpose of discerning machine-generated text. We conclude that our examined subset of linguistically motivated features alongside probabilistic features was not able to contribute almost any performance at all to a hybrid classifier of human and machine texts.

pdf bib abs
SemEval 2024 - Task 10: Emotion Discovery and Reasoning its Flip in Conversation (EDiReF)
Shivani Kumar | Md. Shad Akhtar | Erik Cambria | Tanmoy Chakraborty

We present SemEval-2024 Task 10, a shared task centred on identifying emotions and finding the rationale behind their flips within monolingual English and Hindi-English code-mixed dialogues. This task comprises three distinct subtasks – emotion recognition in conversation for code-mixed dialogues, emotion flip reasoning for code-mixed dialogues, and emotion flip reasoning for English dialogues. Participating systems were tasked to automatically execute one or more of these subtasks. The datasets for these tasks comprise manually annotated conversations focusing on emotions and triggers for emotion shifts.1 A total of 84 participants engaged in this task, with the most adept systems attaining F1-scores of 0.70, 0.79, and 0.76 for the respective subtasks. This paper summarises the results and findings from 24 teams alongside their system descriptions.

pdf bib abs
SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials
Mael Jullien | Marco Valentino | André Freitas

Large Language Models (LLMs) are at the forefront of NLP achievements but fall short in dealing with shortcut learning, factual inconsistency, and vulnerability to adversarial inputs. These shortcomings are especially critical in medical contexts, where they can misrepresent actual model capabilities. Addressing this, we present SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials. Our contributions include the refined NLI4CT-P dataset (i.e. Natural Language Inference for Clinical Trials - Perturbed), designed to challenge LLMs with interventional and causal reasoning tasks, along with a comprehensive evaluation of methods and results for participant submissions. A total of 106 participants registered for the task contributing to over 1200 individual submissions and 25 system overview papers. This initiative aims to advance the robustness and applicability of NLI models in healthcare, ensuring safer and more dependable AI assistance in clinical decision-making. We anticipate that the dataset, models, and outcomes of this task can support future research in the field of biomedical NLI. The dataset, competition leaderboard, and website are publicly available.

We present the first shared task on Semantic Textual Relatedness (STR). While earlier shared tasks primarily focused on semantic similarity, we instead investigate the broader phenomenon of semantic relatedness across 14 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Punjabi, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by the relatively limited availability of NLP resources. Each instance in the datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. Participating systems were asked to rank sentence pairs by their closeness in meaning (i.e., their degree of semantic relatedness) in the 14 languages in three main tracks: (a) supervised, (b) unsupervised, and (c) crosslingual. The task attracted 163 participants. We received 70 submissions in total (across all tasks) from 51 different teams, and 38 system description papers. We report on the best-performing systems as well as the most common and the most effective approaches for the three different tracks.

This paper presents the results of the SHROOM, a shared task focused on detecting hallucinations: outputs from natural language generation (NLG) systems that are fluent, yet inaccurate. Such cases of overgeneration put in jeopardy many NLG applications, where correctness is often mission-critical. The shared task was conducted with a newly constructed dataset of 4000 model outputs labeled by 5 annotators each, spanning 3 NLP tasks: machine translation, paraphrase generation and definition modeling.The shared task was tackled by a total of 58 different users grouped in 42 teams, out of which 26 elected to write a system description paper; collectively, they submitted over 300 prediction sets on both tracks of the shared task. We observe a number of key trends in how this approach was tackled—many participants rely on a handful of model, and often rely either on synthetic data for fine-tuning or zero-shot prompting strategies. While a majority of the teams did outperform our proposed baseline system, the performances of top-scoring systems are still consistent with a random handling of the more challenging items.

pdf bib abs
SemEval-2024 Task 9: BRAINTEASER: A Novel Task Defying Common Sense
Yifan Jiang | Filip Ilievski | Kaixin Ma

While vertical thinking relies on logical and commonsense reasoning, lateral thinking requires systems to defy commonsense associations and overwrite them through unconventional thinking. Lateral thinking has been shown to be challenging for current models but has received little attention. A recent benchmark, BRAINTEASER, aims to evaluate current models’ lateral thinking ability in a zero-shot setting. In this paper, we split the original benchmark to also support fine-tuning setting and present SemEval Task 9, BRAINTEASER(S), the first task at this competition designed to test the system’s reasoning and lateral thinking ability. As a popular task, BRAINTEASER(S)’s two subtasks receive 483 team submissions from 182 participants during the competition. This paper provides a fine-grained system analysis of the competition results, together with a reflection on what this means for the ability of the systems to reason laterally.We hope that the BRAINTEASER(S) subtasks and findings in this paper can stimulate future work on lateral thinking and robust reasoning by computational models

The automatic identification of misleading and persuasive content has emerged as a significant issue among various stakeholders, including social media platforms, policymakers, and the broader society. To tackle this issue within the context of memes, we organized a shared task at SemEval-2024, focusing on the multilingual detection of persuasion techniques. This paper outlines the dataset, the organization of the task, the evaluation framework, the outcomes, and the systems that participated. The task targets memes in four languages, with the inclusion of three surprise test datasets in Bulgarian, North Macedonian, and Arabic. It encompasses three subtasks: (i) identifying whether a meme utilizes a persuasion technique; (ii) identifying persuasion techniques within the meme’s ”textual content”; and (iii) identifying persuasion techniques across both the textual and visual components of the meme (a multimodal task). Furthermore, due to the complex nature of persuasion techniques, we present a hierarchy that groups the 22 persuasion techniques into several levels of categories. This became one of the attractive shared tasks in SemEval 2024, with 153 teams registered, 48 teams submitting results, and finally, 32 system description papers submitted.

pdf bib abs
SemEval-2024 Task 5: Argument Reasoning in Civil Procedure
Lena Held | Ivan Habernal

This paper describes the results of SemEval-2024 Task 5: Argument Reasoning in Civil Procedure, consisting of a single task on judging and reasoning about the answers to questions in U.S. civil procedure. The dataset for this task contains question, answer and explanation pairs taken from The Glannon Guide To Civil Procedure (Glannon, 2018). The task was to classify in a binary manner if the answer is a correct choice for the question or not. Twenty participants submitted their solutions, with the best results achieving a remarkable 82.31% F1-score. We summarize and analyze the results from all participating systems and provide an overview over the systems of 14 participants.

pdf bib abs
SemEval-2024 Task 3: Multimodal Emotion Cause Analysis in Conversations
Fanfan Wang | Heqing Ma | Rui Xia | Jianfei Yu | Erik Cambria

The ability to understand emotions is an essential component of human-like artificial intelligence, as emotions greatly influence human cognition, decision making, and social interactions. In addition to emotion recognition in conversations, the task of identifying the potential causes behind an individual’s emotional state in conversations, is of great importance in many application scenarios. We organize SemEval-2024 Task 3, named Multimodal Emotion Cause Analysis in Conversations, which aims at extracting all pairs of emotions and their corresponding causes from conversations. Under different modality settings, it consists of two subtasks: Textual Emotion-Cause Pair Extraction in Conversations (TECPE) and Multimodal Emotion-Cause Pair Extraction in Conversations (MECPE). The shared task has attracted 143 registrations and 216 successful submissions.In this paper, we introduce the task, dataset and evaluation settings, summarize the systems of the top teams, and discuss the findings of the participants.

pdf bib abs
SheffieldVeraAI at SemEval-2024 Task 4: Prompting and fine-tuning a Large Vision-Language Model for Binary Classification of Persuasion Techniques in Memes
Charlie Grimshaw | Kalina Bontcheva | Xingyi Song

This paper describes our approach for SemEval-2024 Task 4: Multilingual Detection of Persuasion Techniques in Memes. Specifically, we concentrate on Subtask 2b, a binary classification challenge that entails categorizing memes as either “propagandistic” or “non-propagandistic”. To address this task, we utilized the large multimodal pretrained model, LLaVa. We explored various prompting strategies and fine-tuning methods, and observed that the model, when not fine-tuned but provided with a few-shot learning examples, achieved the best performance. Additionally, we enhanced the model’s multilingual capabilities by integrating a machine translation model. Our system secured the 2nd place in the Arabic language category.

We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.