1 Introduction
The rise of deep neural networks and self-supervised learning in recent years have brought about a paradigm shift in Information Retrieval. From retrieval to ranking, question answering to recommendation, search to conversational agents, models trained on hand-crafted features have given way to complex neural networks built of millions of parameters that are capable of learning granular features from raw data.
While this transition has led to large gains in efficacy in various tasks, it has been done so often at the expense of training and inference efficiency. With deep models embedded in evermore applications and devices with a drive towards ever higher efficacy, the rise in costs has a tangible, though often under-reported, impact on researchers, practitioners, users, and more importantly, the environment. It is therefore unsurprising that the difficult balancing act between celebrating effectiveness and seeking efficiency has resurrected old research questions from the field with a renewed urgency.
The aim of this Special Section is to engage with researchers in Information Retrieval, Natural Language Processing and related areas and gather insight into the core challenges in measuring, reporting, and optimizing all facets of efficiency in
Neural Information Retrieval (NIR) systems, including time-, space-, resource-, sample-, and energy-efficiency, among other factors. While researchers in the field have assiduously explored the Pareto frontier in quality and efficiency in other contexts for decades, we believe that the neural dimension introduces new hurdles [
2,
3,
4].
The breadth of the challenges facing NIR systems is reflected in the submissions received by the editors of this Special Section. The call for papers attracted 13 submissions in total, of which 7 have been accepted to appear in the journal. These touch on topics ranging from the ranking of long documents, sparse representation learning, late interaction models, sample efficiency, and sequential recommendation. In what follows, we briefly describe each article but invite the reader to review the relevant publication for details.
1.1 Ranking Long Queries and Documents
Let us start with the problem of ranking long sequences of text. Many neural ranking models rely on computationally intensive Transformer [
14] blocks. Typically, these blocks have a hard limit on the input sequence length. That is because their complexity grows quadratically with the length of the sequence. When using these models on longer sequences, therefore, we must either truncate the sequence or modify the Transformer block to lower its time complexity.
The first article that approaches that problem is entitled “Revisiting Bag of Words Document Representations for Efficient Ranking with Transformers.” It asks if there is a third way of handling long sequences if we opened up the model itself. In particular, rather than truncating a sequence arbitrarily, the authors propose to represent a document with its “salient” terms only. In effect, a document is condensed into its “characteristic” terms. The main question then becomes: How do we define a salient term and what is the impact of different definitions of salience on ranking quality? That is the question this article explores in great depth.
The second article in the same category, “Retrieval for Extremely Long Queries and Documents with RPRS,” approaches the same problem but considers a setup where queries, too, can be long sequences. This use-case arises in many real-world applications such as patent search or search over legal documents. The article explores a solution where long text queries and documents are split into chunks along sentence boundaries. When a set of candidate documents is returned by a neural ranker, the article investigates methods of re-ranking the candidate set according to different (unsupervised) similarity metrics.
1.2 Retrieval with Sparse Representations
One of the more interesting neural retrieval paradigms that has emerged in the past few years attempts to learn sparse representations of text documents and use existing inverted index-based technologies [
13] to perform efficient retrieval over the sparse representations [
1,
5,
7,
8,
9,
11,
12,
15,
16]. The output space of such models has as many dimensions, as there are terms in the vocabulary—typically, this is the BERT [
6] vocabulary. This one-to-one mapping between output dimensions and terms in the vocabulary make such representations highly interpretable, which makes them attractive in many applications.
The main challenge in training such models is to achieve high effectiveness but maintain enough sparsity in the learned representations so retrieval stays efficient; that is because inverted index-based algorithms operate under the assumption that queries consist of only a few terms and that term frequencies within documents follow a Zipfian distribution. The article entitled “Towards Effective and Efficient Sparse Neural Information Retrieval” details a long thread of research that studies that very question and explores the tradeoffs between the effectivess and efficiency of learned sparse representations in the context of text retrieval.
1.3 Retrieval with Forward Indexes
The article entitled “Efficient Neural Ranking Using Forward Indexes and Lightweight Encoders” introduces a novel index structure called Fast-Forward indexes, which takes advantage of the ability of dual encoders to pre-compute document representations to significantly improve the efficiency of the re-ranking phase. The authors show that a simple interpolation-based re-ranking combines the benefits of lexical—computed using sparse retrieval—and semantic—computed using dual encoders—similarity, which can result in competitive and sometimes better performance than cross-attention. They thus exploit fast-forward indexes to efficiently handle document representations generated with dual encoders within the re-ranking phase. Experiments on public datasets show that dual encoders combined with Fast-Forward indexes provide lower per-query latency and achieve competitive results without needing hardware acceleration such as GPUs.
1.4 Late Interaction
Late-interaction multi-vector models, such as ColBERT [
10] and COIL [
9], achieve state-of-the-art retrieval effectiveness by using all token embeddings to represent documents and queries while modeling their relevance with a sum-of-max operation. The limitation of these fine-grained representations is the space overhead resulting from having to store all token embeddings.
In an attempt to lower the storage costs, “An Analysis on Matching Mechanisms and Token Pruning for Late-interaction Models” investigates the matching mechanism of these late-interaction models. It shows that the sum-of-max operation heavily relies on the co-occurrence signals and certain important words in the document. Based on these findings, the authors propose several simple document pruning methods to reduce the storage overhead and compare the effectiveness of different pruning methods on different late-interaction models. The investigation also covers query pruning methods to reduce the retrieval latency further.
1.5 Sample Efficiency
The article entitled “Data Augmentation for Sample Efficient and Robust Document Ranking” investigates a rather different aspect of efficiency: one that focuses on training samples. Training a ranking model where there is too few training examples available is challenging. The hypothesis this work sets out to explore is whether data augmentation techniques, combined with contrastive learning, can remedy some of those challenges and lead to improved ranking quality. The authors present a comprehensive analysis of various data augmentation methods and contrastive losses in the context of different model sizes. Their experimental results are encouraging: Ranking quality improves in both in-domain and out-of-domain settings, with even larger language models benefiting from this scheme. Their findings bode well for sample efficiency: With appropriate data augmentation and contrastive learning formulation, fewer training examples are needed to train high-quality ranking models.
1.6 Sequential Recommendation
Finally, “Teach and Explore: A Multiplex Information-guided Effective and Efficient Reinforcement Learning for Sequential Recommendation” explores the application of Reinforcement Learning (RL) within a Sequential Recommendation (SR) system. It claims that current approaches in this direction are sub-optimal because (1) they fail to leverage the supervision signals to capture users’ explicit preferences, and (2) they do not utilize auxiliary information (e.g., knowledge graphs) to avoid blindness when exploring users’ potential interests.
To overcome the two limitations, the authors propose a multiplex information-guided RL model (MELOD), which uses a novel RL training framework with Teach and Explore components for SR. MELOD considers the SR task as a sequential decision problem and consists of three novel extensions, state encoding, policy function, and RL training, that concur to learn a comprehensive user representation. Experiments on seven real-world datasets show that MELOD achieves significant performance improvement in terms of Hit-Ratio and Normalized Discounted Cumulative Gain over 13 state-of-the-art competitors.