Computer Science > Machine Learning

arXiv:2402.16829 (cs)

[Submitted on 26 Feb 2024]

Title:GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Abstract:Embedding models are integral to AI applications like semantic search, personalized recommendations, and retrieval augmented generation for LLMs, necessitating high-quality training data. However, the limited scalability of manual data curation prompts the need for automated methods to ensure data integrity. Traditional unsupervised triplet mining automates training data generation, crucial for embedding model training, yet inadvertently injects biases and noise, thereby degrading model performance. Addressing this, we introduce GISTEmbed, a novel strategy that enhances in-batch negative selection during contrastive training through a guide model. This approach departs from reliance on random sampling and equal utility assumption of batch negatives, significantly reducing noise from data quality issues and improving model fine-tuning. Benchmarked against the Massive Text Embedding Benchmark (MTEB), GISTEmbed showcases consistent performance improvements across various model sizes and achieves state-of-the-art results in select categories. This framework enables significant enhancements for smaller models by leveraging the capabilities of powerful yet resource-intensive large models. GISTEmbed can potentially revolutionize the creation of highly efficient, smaller models, democratizing access to advanced AI technologies. Making these technologies more accessible and cost-effective, especially for applications constrained by resources, significantly expands the impact and accessibility of state-of-the-art AI solutions across diverse sectors.

Comments:	GISTEmbed GitHub repository at this https URL
Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2402.16829 [cs.LG]
	(or arXiv:2402.16829v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.16829

Submission history

From: Aivin Solatorio [view email]
[v1] Mon, 26 Feb 2024 18:55:15 UTC (1,518 KB)

Computer Science > Machine Learning

Title:GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators