Computer Science > Computation and Language

arXiv:2307.02912 (cs)

[Submitted on 6 Jul 2023]

Title:LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Authors:Mario Almagro, Emilio Almazán, Diego Ortego, David Jiménez

View PDF

Abstract:Textual noise, such as typos or abbreviations, is a well-known issue that penalizes vanilla Transformers for most downstream tasks. We show that this is also the case for sentence similarity, a fundamental task in multiple domains, e.g. matching, retrieval or paraphrasing. Sentence similarity can be approached using cross-encoders, where the two sentences are concatenated in the input allowing the model to exploit the inter-relations between them. Previous works addressing the noise issue mainly rely on data augmentation strategies, showing improved robustness when dealing with corrupted samples that are similar to the ones used for training. However, all these methods still suffer from the token distribution shift induced by typos. In this work, we propose to tackle textual noise by equipping cross-encoders with a novel LExical-aware Attention module (LEA) that incorporates lexical similarities between words in both sentences. By using raw text similarities, our approach avoids the tokenization shift problem obtaining improved robustness. We demonstrate that the attention bias introduced by LEA helps cross-encoders to tackle complex scenarios with textual noise, specially in domains with short-text descriptions and limited context. Experiments using three popular Transformer encoders in five e-commerce datasets for product matching show that LEA consistently boosts performance under the presence of noise, while remaining competitive on the original (clean) splits. We also evaluate our approach in two datasets for textual entailment and paraphrasing showing that LEA is robust to typos in domains with longer sentences and more natural context. Additionally, we thoroughly analyze several design choices in our approach, providing insights about the impact of the decisions made and fostering future research in cross-encoders dealing with typos.

Comments:	KDD'23 conference (main research track). (*) These authors contributed equally
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2307.02912 [cs.CL]
	(or arXiv:2307.02912v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2307.02912

Submission history

From: David Jimenez-Cabello [view email]
[v1] Thu, 6 Jul 2023 10:53:50 UTC (1,082 KB)

Computer Science > Computation and Language

Title:LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:LEA: Improving Sentence Similarity Robustness to Typos Using Lexical Attention Bias

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators