Open AccessArticle

LIPT: Improving Prompt Tuning with Late Inception Reparameterization

School of Computer Science, Chengdu University of Information Technology, Chengdu 610225, China

School of Electronic Information Engineering, Geely University of China, Chengdu 641423, China

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4741; https://doi.org/10.3390/electronics13234741

Submission received: 4 November 2024 / Revised: 25 November 2024 / Accepted: 28 November 2024 / Published: 29 November 2024

(This article belongs to the Special Issue Emerging Theory and Applications in Natural Language Processing)

Download

Browse Figures

Figure 1
Overview of LIPT framework. Yellow represents trainable (tunable) modules, while blue represents frozen (non-trainable) modules. Left: A prompt generator with three bottleneck branches. The initialized prompt passes through three bottleneck branches of varying sizes and is then added back to the initial prompt (this connection pattern is similar to the Inception architecture). Right: The architecture of the transformer-based base model, along with the forward propagation and backpropagation paths. During forward propagation, the prompt generator on the left is inserted into the specified “Prompt Layer” of the model on the right in parallel. The generated prompt is concatenated with the output of the Prompt Layer and passed to the next layer. During backpropagation, only the prompt generator network is fine-tuned. "> Figure 2
Evaluate a one-sentence task and a two-sentence task. (Left): Comparison of five types of initialization methods. (Right): Add a self-connecting effect. "> Figure 3
LIPT performance on different tasks. (Left): Comparison of bottleneck numbers. (Right): Different module sizes. "> Figure 4
Performance trends of two single-sentence tasks and two two-sentence tasks at different insertion layers. The RoBERTa-large backbone model is used, selecting even-numbered layers between the 10th and 24th layers. The shaded area shows the mean and standard deviation of 3 different random runs. ">

Versions Notes

Abstract

Prompt tuning is a mainstream technique for fine-tuning large language models (LLMs), offering minimal parameter adjustments by learning task-specific prompt vectors. However, it suffers from training costs due to network-wide backpropagation and weaker performance compared to methods like adapters and LoRA, likely due to the limited capacity of soft prompts to encode task-specific information. This study introduces Late Inception Prompt Tuning (LIPT), a novel approach to soft prompt learning that enhances performance and efficiency by shortening backpropagation paths and employing a multidimensional bottleneck network with greater capacity. LIPT surpasses existing prompt tuning techniques on various benchmark tasks, delivering a 1.3% gain over LPT and a 5% improvement compared to standard prompt tuning when applied to RoBERTa-large, while converging more rapidly. It achieves an average accuracy of 90% across ten benchmark datasets. Notably, in certain scenarios, LIPT’s performance approaches that of full-parameter fine-tuning methods. To evaluate parameter-efficient fine-tuning (PEFT) comprehensively, we propose an Efficiency Indicator (EI) that balances accuracy and cost. LIPT is well suited for natural language understanding tasks, like sentiment analysis and text classification, with potential extensions to larger-scale models and tasks like text generation. This framework advances the scalability and practicality of fine-tuning methods for diverse applications.

Keywords:

large language model; parameter-efficient fine-tuning; prompt tuning

1. Introduction

Research on deep learning models is moving rapidly due to the availability of extensive datasets and powerful computational resources. This growth has led to the creation of foundational models with vast parameter scales across various applications [1,2], including language, vision, and multimodal tasks, which demonstrate notable generalization capabilities of these models. Use of gigantic models, like 175B GPT-3 [3] and the 1.6T Switch Transformer [4], has become prevalent due to their effectiveness across numerous studies. However, the expansion rate of model size exceeds the growth in computational resources by nearly two orders of magnitude [5]. The rapid scaling of pre-trained models [6,7,8] brings substantial challenges, mainly involving the high costs in training and deployment. In this context, prompt tuning [9] has emerged as a promising solution to mitigate these challenges by focusing on parameter-efficient learning, which avoids the need for fully fine-tuning massive models for every task. Maintaining individual fine-tuned models for each task becomes increasingly impractical with billion-parameter models, further compounding these challenges.

To address the challenges associated with optimizing large, pre-trained language models, recent research has focused on parameter-efficient transfer learning techniques. Parameter-efficient fine-tuning (PEFT) selects a specific subset of model parameters [7], introduces auxiliary structures, or reparameterizes trainable components. During training, most model parameters remain frozen, with only the newly introduced or selected parameters being updated. PEFT methods significantly reduce the number of trainable parameters for downstream tasks, sometimes by more than 95% [10], while retaining performance levels comparable to fine-tuning full parameters.

Among these PEFT techniques, prompt tuning is a popular approach that enhances input sequences with a set of trainable parameters, or “soft prompts”. While prompt tuning offers clear benefits in terms of parameter efficiency and deployment cost, its performance and convergence speed have no advantage when compared to other methods, like adapters [11,12,13,14] or Bias-term Fine-tuning (BitFit) [15]. Current research seeks to enhance prompt tuning by improving the positioning and initialization of soft prompts [16,17,18,19].

Our assumption is that prompt tuning has limited capability because of its design to try fully encapsulating the downstream task. In this paper, we introduce Late Inception Prompt Tuning (LIPT), a novel approach that enhances and stabilizes prompt tuning performance by embedding an Inception-based soft prompt generator into deeper layers of the model. LIPT reparameterizes prompts and integrates them into the model, allowing additional flexibility in using distinct or shared embeddings, which contributes to improved representation power of downstream tasks. Experiments on the GLUE benchmark [20] with RoBERTa-Large [21] and GPT-2 [22] demonstrate LIPT’s superior performance compared to existing prompt tuning methods. In the PEFT research field, the absence of standardized benchmarks and metrics complicates the comparison of different approaches. Although many methods aim to reduce parameter counts, such reductions do not consistently lead to improved efficiency, and performance comparisons across multiple dimensions often yield inconclusive results. To address this, we also introduce the Efficiency Indicator (EI), a comprehensive metric that combines task-specific accuracy with training cost.

The main contributions of this work include the following:

Introduction of LIPT, a novel soft prompt generation method based on an Inception structure.
Comprehensive test of LIPT with RoBERTa-large and GPT-2 backbones, showing improved performance on ten standard text classification tasks in both full-data and few-shot scenarios.
Introduction of an Efficiency Indicator metric for comprehensive performance evaluation of PEFT methods.

2. Related Work

2.1. Parameter-Efficient Fine-Tuning

Fine-tuning all model parameters [23,24] usually results in minimal performance loss, but it may incur significant costs for large language models. PEFT aims to achieve performance comparable to 100% fine-tuning by adjusting a small proportion of the parameters [12,13,25,26,27]. Regarding the parameter selection strategy, PEFT methods are usually divided into four categories: adding extra parameters to the original model [9,11,16,17,28,29], which introduces additional trainable parameters; selecting certain parameters from the original model [15]; reparameterization methods [30] that transform the fine-tuning process into optimization in low-dimensional subspaces; and hybrid methods [13]. Through continuous optimization and extensive validation, PEFT has demonstrated its effectiveness in many natural language processing (NLP) tasks.

Considering a classification task T in which input text x is associated with an output label y, PEFT either adds new parameters or adjusts/reparameterizes a subset of the original parameters without altering others. If we use

θ_{t}

to represent the tunable parameters and

θ_{l m}

to represent the frozen parameters in the model, PEFT is equivalent to a parameter optimization process

max_{θ_{t}} \sum_{x, y \in T} log P (y | x; θ_{l m}, θ_{t})

(1)

that maximizes the log-likelihood on task T conditioned on optimal parameters

θ_{t}

Our proposed LIPT model is a hybrid method involving both soft prompt and reparameterization, so we focus on them in the following sections.

2.2. Prompt Tuning

As a lightweight alternative to full fine-tuning, prompt tuning adds soft prompts to the original model with a small number of additional parameters [31]. It usually involves soft prompts that constitute less than 0.1% of the original parameters and can be easily appended to inputs without modifying the model. Its adds a series of virtual token embeddings or soft prompt p to the input text x beforehand, training them in downstream tasks while maintaining other model parameters fixed. If

θ_{p}

represents the trainable parameters corresponding to the prompt p, the training objective can be formulated as follows:

max_{θ_{p}} \sum_{x, y \in T} log P (y | [p; x]; θ_{l m}, θ_{p})

(2)

2.3. Reparameterization-Based Methods

Reparameterization-based methods [30,32,33,34,35] project the original training space into a lower-dimensional space to facilitate the adaptation of large-scale base models to downstream tasks. In an attempt to reparameterize soft prompts, Residual Prompt Tuning (RPT) [18] stabilizes the prompt tuning process by employing residual reparameterization, utilizing a randomly initialized autoencoder and residual links. Decomposed Prompt Tuning (DPT) [19] adopts the RPT idea and proposes a method for low-rank approximation compression tailored to soft prompts.

3. Method

3.1. Problem Formulation

In the prompt tuning setup, downstream tasks are formulated as masked language modeling to bridge the gap between pre-training and fine-tuning. It involves inserting a randomly initialized soft prompt token p into word embeddings, representing a specific token sequence. Various manually designed templates are employed to tailor the input for different downstream tasks. For example, in single-sentence tasks, the input is transformed into a template like

concat (p, E ([CLS] 〈 S_{1} 〉 It was [MASK] . [SEP]))

where

E (x)

maps tokens from the input sequence x to embedding vectors. Subsequently, the original labels Y are mapped to words in the model’s vocabulary V (label words). The final hidden state of the [MASK] token is then utilized by the pre-trained Masked Language Modeling (MLM) head to predict these label words. During the fine-tuning of downstream tasks, the backbone of the pre-trained language models (PLMs) and MLM head remain frozen, with only the soft prompt token p being adjusted. Consequently, downstream tasks are effectively transformed into MLM tasks.

In the proposed LIPT framework, the randomly initialized soft prompt token p is incorporated into the posterior of the base model via bottleneck networks of varying sizes, serving as a novel prompt initialization. This insertion occurs within a dedicated layer termed the Prompt Layer (PL).

3.2. Late Inception Prompt Tuning

Inspired by the RPT [18] method, we introduce a similar approach to reparameterize soft prompts with a structure similar to inception blocks (see Figure 1). More precisely, within LIPT, we project a sequence P comprising n tokens

[P_{1}^{'}, \dots, P_{n}^{'}]

, which represents the prompt embedding, into a reparameterized sequence

P^{'} = [P_{1}^{'}, \dots, P_{n}^{'}] = [Φ (P_{1}), \dots, Φ (P_{n})]

(3)

Here,

Φ (\cdot)

is a reparameterized network with an Inception-ResNet structure, independently applied to each prompt token:

Φ (P_{i}) = ϕ_{1} (P_{i}) + ϕ_{2} (P_{i}) + ϕ_{3} (P_{i}) + P_{i}, i \in {1 \dots n}

(4)

The

ϕ_{j} (\cdot), j \in {1, 2, 3}

network is a Multilayer Perceptron (MLP) following the bottleneck design of the adapter module. It comprises lower projection

W_{d o w n} \in R_{d \times m}

expressed as

j_{1}

and upper projection

W_{u p} \in R_{m \times d}

layers expressed as

j_{2}

, whose combination has been thoroughly investigated in [11].

ϕ_{j} (P_{i}) = W_{j_{2}} (R e L U (W_{j_{1}} P_{i} + b_{j_{1}})) + b_{j_{2}}, j \in {1, 2, 3}, i \in {1 \dots n}

(5)

d represents the dimensionality of the model embedding and stands for the bottleneck size of the MLP.

ϕ_{1} (\cdot), ϕ_{2} (\cdot), ϕ_{3} (\cdot)

are MLPs with bottleneck sizes increasing multiplicatively. We exclusively train the prompt embedding

θ_{p}

and the reparameterization parameters

θ_{ϕ}

for downstream tasks while maintaining all other parameters frozen. The training objective aims to maximize the log-likelihood of the correct output given the input text concatenated with the reparameterized soft prompt:

max_{θ_{ϕ}, θ_{P}} \sum_{x, y \in T} log P (y | [p; x]; θ_{l m}, θ_{ϕ}, θ_{P})

(6)

Additionally, we shorten the backpropagation path by inserting the reparameterized prompt only into the later layers of the model. Exploration of the prompt insertion layer is depicted in Section 6.3, which tests the appropriate layer of prompt insertion for RoBERTa-large.

3.3. Design Choices

Essential design choices for the LIPT method are discussed here.

Inception-ResNet Integration. Inception-ResNet combines the advantages of Inception modules and ResNet by introducing residual connections within the Inception modules, further enhancing network performance and training depth. Specifically, Inception-ResNet builds upon the Inception module by adding cross-layer residual connections, which allow skipping one or more Inception modules and directly passing the input to subsequent layers. We hypothesize that MLP blocks with different bottleneck sizes enable the model to accommodate information from other layers after random initialization. One of the main advantages of this architecture is that MLPs with larger bottleneck dimensions (e.g., 800 in RPT [18]) often significantly increase the parameter count. However, decomposing an extensive bottleneck network into multiple smaller networks does not lead to uncontrollable computational complexity escalation. Another practical aspect of this design is its intuitive nature, as it suggests that information should be processed at different scales before being aggregated, allowing the extraction of abstract features from different scales simultaneously for the next stage. The current version of the Inception-ResNet architecture is limited to bottleneck sizes of 64, 128, and 256. This choice is mostly based on past experiences. According to the reference paper [16], Late Prompt Tuning (LPT) adopts a bottleneck size of 128. In our method, we use 128 as the intermediate value and introduce two additional bottleneck configurations: one reduces the bottleneck size to half of the intermediate value, i.e., 64, while the other increases it to double the intermediate value, i.e., 256. Smaller bottleneck sizes, such as 64, help reduce computational overhead, which is critical for scaling models to larger sizes. In contrast, larger sizes, such as 256, allow for greater representation capacity. Additionally, preliminary experiments indicated that these configurations achieved a good trade-off between performance and efficiency in our specific task setup. The main idea behind the Inception architecture is to approximate and cover the optimal local sparse structures in the network using easily obtainable dense components. This structure not only helps alleviate the gradient vanishing problem, enabling the practical training of deeper models, but also enhances the feature representation capability of the network, as residual connections make it easier for the network to learn useful feature representations.

Inception-ResNet Variant. We apply the shared reparameterization network

Φ

to the entire soft prompt embedding in our setup. Another design choice is to divide the whole soft prompt into two parts and apply separate bottlenecks of two sizes to each prompt embedding. This is a superficial imitation improvement similar to the depth-wise separable convolution Xception. We compare these two variants in Section 6.2. Overall, Xception with two disparate networks is more parameter-efficient but does not perform as well as Inception-ResNet.

Bottleneck Networks. We use a two-layer MLP, with the upper projection matrix

W_{u p}

and the lower projection matrix

W_{d o w n}

constituting additional trainable parameters. For other networks, such as Long Short-Term Memory (LSTM) and attention, experimental results in RPT have demonstrated that they do not perform as well as simple feedforward neural networks. Therefore, those alternatives are not considered in the experiment.

Initialization. There are various methods for initializing soft prompts, including complete random initialization, randomly selecting n words from the model’s vocabulary, and randomizing within a fixed range, as well as the initialization used in LPT, which combines hidden vectors with prompt generators. Through experiments in Section 6.1, we adopt the initialization scheme of randomly selecting n tokens from the vocabulary, which yields the best result.

3.4. Efficiency Indicator

To evaluate model performance comprehensively with respect to both accuracy and training cost, we introduce a composite metric, Efficiency Indicator (EI), which combines normalized accuracy with a nonlinearly adjusted training cost indicator. The training cost indicator itself is derived by combining training speed and parameter size, ensuring a balanced representation of cost. This section details the calculation of the Efficiency Indicator, including steps for normalization and parameter selection for nonlinear adjustment.

3.4.1. Metric Selection

Each metric is chosen based on its ability to capture different aspects of model performance.

Correctness. Correctness takes the performance metrics on various experiment datasets, including accuracy and F1 score in Section 4. Accuracy is commonly used to evaluate the performance of classification models. The F1 score is a statistical measure used to evaluate the precision and recall of binary (or multi-class) classification models.

Speed. Speed refers to the model training speed, measured in tokens per second (token/s). This metric evaluates the time efficiency of the model during training.

Parameter Efficiency. Parameter efficiency refers to the effective use and management of the model’s learned weights and biases. In the context of NLP, optimizing these parameters can maintain or enhance model performance even with limited resources.

3.4.2. Accuracy Normalization

To ensure comparability across models and avoid zero values in the final metric, we normalize the accuracy improvement values with respect to a predefined range. Given accuracy values spanning from the minimum accuracy,

{acc}_{min}

to the maximum accuracy,

{acc}_{max}

, we perform min–max normalization as follows:

acc_norm = \frac{accuracy - {acc}_{min}}{{acc}_{max} - {acc}_{min}}

(7)

This transformation maps accuracy values to a normalized range of

[0, 1]

, with 1 representing the maximum accuracy and 0 the minimum. This normalized accuracy,

acc_norm

, is subsequently used to calculate the Efficiency Indicator.

3.4.3. Training Cost Indicator Composition

The Training Cost Indicator is derived by normalizing and combining training speed with parameter size, using the geometric mean of these two factors to ensure they are treated equally. Parameter size directly impacts model complexity and training cost, while training speed reflects the time efficiency of the training process.

Training Speed Normalization: Since higher training speed implies lower cost, we apply inverse normalization to training speed:

$speed_norm = \frac{maximum training speed}{training speed}$

(8)
Parameter Size Logarithmic Normalization: To handle the large scale of parameter values, we use logarithmic normalization with base 10:

$param_norm = {log}_{10} (parameter size)$

(9)

The Training Cost Indicator is calculated as the geometric mean of these normalized values:

Training Cost Indicator = \sqrt{speed_norm \times param_norm}

(10)

This combined indicator increases with greater training demands, thereby aligning with higher associated costs.

3.4.4. Nonlinear Adjustment Using the Sigmoid Function

To reflect the nonlinear, saturating relationship between accuracy gains and training costs, a sigmoid function is applied to the Training Cost Indicator, creating an adjusted indicator that amplifies differences at lower costs and diminishes returns at higher costs.

The adjusted training cost is defined as

Adjusted Training Cost = \frac{1}{1 + e^{- k \cdot (Training Cost Indicator - x_{0})}}

(11)

where

k controls the slope of the curve, with larger values yielding sharper transitions around $x_{0}$ .
$x_{0}$ represents the midpoint of the distribution, allowing for cost adjustments to be centered around the median or mean value.

With Training Cost Indicator values,

x_{0}

is set as the median or mean. We select

k = 2

, which provides moderate nonlinearity, amplifying differences near the median and saturating at higher cost values.

3.4.5. Calculation of the Final Efficiency Indicator

Finally, we combine normalized accuracy with the adjusted training cost to obtain the Efficiency Indicator. This metric quantifies the effectiveness of accuracy improvements relative to training costs, with higher values indicating greater efficiency:

Efficiency Indicator = \frac{acc_norm}{Adjusted Training Cost}

(12)

By emphasizing models that achieve high accuracy with lower adjusted training costs, the Efficiency Indicator offers a balanced and interpretable measure of cost-effectiveness in training.

3.4.6. Summary

This methodology combines normalized accuracy with a sigmoid-adjusted Training Cost Indicator, ensuring a realistic evaluation of efficiency. Through nonlinear adjustment, it reflects the diminishing return characteristic of increasing training costs, facilitating comparisons of cost-effectiveness across models with varying resource demands. The results of this integrated evaluation are illustrated in Table 1, showing the effectiveness of the proposed multidimensional evaluation method.

4. Experiments

4.1. Datasets

Built upon previous research on prompt tuning [16,17], we conduct an evaluation of our LIPT method on five single-sentence and five sentence pair classification tasks. These tasks include six Natural Language Understanding (NLU) tasks from the General Language Understanding Evaluation (GLUE) benchmark [20], along with four additional tasks: MPQA [36], MR [37], Subj [38], and TREC [39]. Details regarding the dataset statistics and evaluation protocols are as follows: For SST-2, MNLI, MRPC, QNLI, QQP3, and RTE datasets, which are from the GLUE benchmark, we use their original data splits. For the 4 other datasets, we select a certain number of samples from the training set as the development set, and the number of samples for each label is determined according to its proportion in the original training set. The dataset statistics after the split are shown in Table 2.

4.2. Experiment Settings

All experiments are conducted on a single Nvidia RTX 3090 GPU with 24 GB of memory. The evaluation is performed using two pre-trained language model (PLM) backbones: RoBERTa-large [21] and GPT-2-large [22], under both full-data and few-shot learning scenarios. GPT-2, an autoregressive transformer-based model, is available in several configurations, including small (124 M parameters), medium (355 M parameters), large (774 M parameters), and XL (1.5 B parameters). In this work, we adopt GPT-2-large to ensure consistency with the settings reported in the LPT paper, allowing for a direct and fair comparison of methods. Similarly, RoBERTa, an improved variant of BERT, offers base (150 M parameters) and large (340 M parameters) versions, with the large version being chosen for the same reason. Unless explicitly stated otherwise, the first layer of prompt injection (L) is set to 16 for RoBERTa-large and 21 for GPT-2-large. The prompt length is fixed at 10 tokens for both models across all experiments. To provide robust results, we report the mean performance and standard deviation of the LIPT models over three random seeds in the full-data scenario and five random seeds in the few-shot scenario, evaluated on the test set.The hyperparameters of the method are detailed in Table A1.

4.3. Baselines

We compare our research with mainstream PEFT methods, including traditional fine-tuning methods, which typically fine-tune all parameters of PLMs. Based on the technical route of LIPT, baseline algorithms are mainly taken from adapter-based tuning and prompt tuning methods. Here, adapter-based tuning includes (1) Adapters [11] and (2) AdapterDrop [12]. For prompt tuning methods, (1) Prompt Tuning [9], (2) P-tuning v2 [40], (3) Instance-Dependent Prompt Generation (IDPG) [41], and (4) LPT [16] are selected. Prompt length is set to 10 for all these methods.

In addition to the methods above, we also consider BitFit [15], based on regularization, and Low-Rank Adaptation (LoRA) [30], based on reparameterization. Certain results are taken directly from the LPT paper [16], but it should be noted that we cannot replicate the exact LPT performance.

5. Results and Discussion

This section may be divided by subheadings. It should provide a concise and precise description of the experimental results, their interpretation, and the experimental conclusions that can be drawn.

5.1. Full-Data Results

We compare LIPT to four classes of PEFT methods, including prompt tuning, using RoBERTa-large and GPT2-large as the backbone models. As a theoretical upper bound of model performance, fine-tuning with full adjustable parameters is also considered. Table 3 presents task-specific performance of the LIPT model and the baseline methods described in Section 4.3, utilizing RoBERTa-large as the base model across ten datasets. The number of trainable parameters is also included for reference. Within the same PEFT method category, LIPT consistently outperforms other prompt tuning methods on RoBERTa-large, with significant improvement in almost all tasks. Specifically, our model outperforms original prompt tuning by 5.05 points on average and beats the LPT baseline by 1.3 points. Compared to other PEFT methods, LIPT demonstrates significant advantages on single-sentence tasks but relatively weaker performance on dual-sentence tasks. However, for dual-sentence tasks that require reasoning or relational understanding between sentences, LIPT’s approach may struggle to capture the necessary inter-sentence relationships. This is because, compared to other parameter-efficient methods, LIPT utilizes less information from the model’s early layers, leading to weaker modeling of sentence relationships. Additionally, sentence pair tasks often benefit from specialized architectures designed to explicitly handle sentence-level dependencies, which LIPT, in its current form, may not fully address. LIPT does not achieve a higher average score than adapter-based methods, but it is more efficient in terms of parameter usage. Finally, our method does not require pre-training on any source task. Table 4 shows a similar performance comparison on four datasets, including two single-sentence and two dual-sentence datasets, with GPT2-large as the base model. LIPT also outperforms vanilla prompt tuning by 9.1 points and is essentially on par with the LPT method.

Previous results show the performance of the LIPT model on both encoder and decoder Transformer models. Our method exhibits strong compatibility with both infrastructures, particularly on the RoBERTa-large model. As an encoder-only architecture, RoBERTa-large extracts features from the input text and encodes them into high-level semantic representations. LIPT further compresses features already extracted by RoBERTa through the insertion of a network into its deeper layers. In contrast, GPT2-large primarily focuses on generating output sequences from previously given encoding representations, lacking comprehensive attentional mechanisms from context on both sides. Our method adeptly disentangles and learns these informational components with improved performance.

5.2. Few-Shot Results

For a more comprehensive understanding of its performance, the LIPT model is also evaluated in a few-shot scenario. With a preset random seed, we randomly pick training samples from the ten training sets above, with a total size of 500. For each method, we run experiments on five different random seeds and report the mean and standard deviation for each task. Validation and test sets are the same as those in the full-data scenarios. The results for the few-shot scenario with 500 training samples are shown in Table 5. The LIPT method outperforms all prompt tuning baselines in the few-shot setting, roughly on par with the LPT baseline. The experimental results indicate good generalization ability of the LIPT method, even with fewer training samples.

5.3. Training Cost

Fewer adjustable parameters do not directly translate into a lower memory requirement and faster training. This section delves into the correlation between the number of adjustable parameters and GPU memory utilization/training speed. Table 1 presents the memory consumption and training speed across various methods. Notably, an excessive emphasis on parameter reduction may not necessarily yield comparable reductions in training costs. For instance, while IDPG and prompt tuning methods exhibit the lowest parameter counts among these approaches, their training speed and memory efficiency are comparatively unsatisfactory. On the contrary, LPT retains a larger number of adjustable parameters yet demonstrates superior training speed and memory utilization. Based on the observation, our key assumption is that the primary contributor to training cost lies in error backpropagation rather than parameter updates. Consequently, even with a modest parameter count, specific methods may fail to reduce training costs, if they involve a long error backpropagation path. The LIPT methods involves a slight increment in parameter quantity due to its bottleneck structure, but its short backpropagation path result in improvement in both training speed and memory consumption.

Based on the multidimensional weighted index, our LIPT method demonstrates the best balance. AdapterDrop and LoRA methods are also good alternatives, along with their wide applicability. Of course, the choice of A, B, and C is purely empirical and may not represent ideal assignment. A theoretically more comprehensive framework to determine the relative importance, along with weight assignment and metric normalization, will help in this multiple-index comparison. Our assumption is that the rank between model pairs may change slightly, but overall preference among those models should be relatively stable.

6. Analysis

6.1. Soft Prompt Initialization

Soft prompt initialization methods are usually categorized into two types, random initialization and initialization, based on the previous layer’s hidden vectors (e.g., the late prompt tuning method), represented as “Hidden” in Figure 2(Left). Random initialization encompasses two specific methods, represented as “Vocab”. and “Rand”. in Figure 2 (Left). “Vocab”. selects n words with the specified prompt length from the model’s vocabulary, while “Rand”. entails simple random initialization within a given range (−0.5, 0.5). Additionally, we explore combinations of these two initialization methods, “H + V” represents the combination of the hidden vector-based method and the vocabulary-based method, and “H + R” represents the combination of the hidden vector-based method and the random initialization method. Following experiments conducted on a single-sentence task (Recognizing Textual Entailment—RTE) and a double-sentence task (Text REtrieval Conference—TREC), we obtained the experimental results illustrated in Figure 2 (Left). The result suggests that the impact of different initializations on double-sentence tasks is relatively small. However, for single-sentence tasks, both random initialization methods outperform the other type, which initializes the soft prompt based on the previous layer’s hidden vectors.

6.2. Exploration of LIPT Structure

Diverse experiments are conducted to select the appropriate internal structure for the LIPT model.

6.2.1. Xception-like Approach

We test an Xception-like block approach to integrate soft prompts and employ bottleneck MLPs of varying sizes to process the soft prompts split into two post-initialization segments. The results in Table 6 indicate that this approach yields inferior performance to Inception despite a reduction in parameter count. Xception may exhibit a stronger ability of processing information across different channels. However, for soft prompts with tighter overall coherence, global parameterization might represent a better choice, given its ability to capture global relationships. The introduction of residual connections shows improved performance on both datasets. Our assumption is that residual connections allow direct bypassing of identity mapping, focusing the approximation more on the small residual difference between the output and input. The experimental results in Figure 2b confirm these observations.

6.2.2. Bottleneck Sizes

The three-bottleneck branch is a key design in the LIPT structure, so we explore a reduced bottleneck size to investigate its impact on the network efficacy. To ensure comparable parameters, the smaller networks’ bottleneck width is increased to match the three-network architecture’s parameter count. Experiment results on two datasets are illustrated in Figure 3 (Left). Notably, the performance steadily increases, peaks at a bottleneck count of three, and then decreases when the bottleneck count reaches four. Furthermore, we investigate whether varying bottleneck sizes across the three networks affects performance. To mitigate the influence of parameter quantity, bottleneck widths of identical and different networks are adjusted to maintain parameter consistency. The results, presented in Figure 3 (Right), demonstrate that varying bottleneck sizes outperform uniform sizes, indicating the importance of adaptability in model architecture.

6.3. Layer Insertion Index

Another design choice for LIPT is the index of the first layer with its soft prompt injection. To investigate the impact of prompt insertion layers, we conducted experiments on RoBERTa-large using four datasets, selecting even-numbered layers between the 10th and 24th layers. Among these, the LIPT method yielded the best performance for layers 14–18, with layer 16 ultimately being chosen based on its superior results. Compared to the 13th layer in LPT, the best location of prompt insertion is at the 16th layer in RoBERTa-large, as indicated by Figure 4. This placement further shortens the backpropagation distance, improving training speed. However, excessive shortening of the propagation path weakens the interaction between the prompt and the input, leading to a diminished influence of the prompt on the model output and eventually causing performance decline.

7. Conclusions

Inspired by the LPT model, we introduce Late Inception Prompt Tuning (LIPT), a novel method for tuning pre-trained language models by injecting Inception-parameterized soft prompts at specific layers. LIPT optimizes soft prompts without extensive hyperparameter tuning, prolonged training, or pre-training on source tasks. Our experiments show that LIPT outperforms LPT and its variants on GLUE benchmarks with both RoBERTa and GPT2 models. Compared with existing fine-tuning methods, LIPT offers moderate-size parameters and enables fast, efficient fine-tuning. While LIPT demonstrates promising results on encoder-only and decoder-only architectures, future work will focus on extending it to encoder–decoder models, which could potentially enhance performance on more complex tasks. Additionally, while our approach has shown substantial improvements across various NLU tasks on RoBERTa-large and GPT2-large, it has not yet been tested on larger-scale models with billions of parameters, due to computational resource constraints. Exploring the scalability of LIPT on such models would be a valuable next step in validating its broader applicability. Furthermore, while LIPT excels in tasks such as classification and answering questions, its performance in more complex tasks, such as text generation, remains under-explored. Despite these limitations, our framework demonstrates strong flexibility, easily extending to other backbone architectures and task types. We anticipate that LIPT can be adapted to a wide range of natural language processing tasks and will be tested in a future work across different scenarios and models. In conclusion, while LIPT introduces an efficient and promising method for fine-tuning pre-trained language models, its potential is still limited by the scope of current evaluation tasks and model sizes. Addressing these limitations in future research will provide a clearer understanding of its broader applicability and open the door for more comprehensive deployments across various domains.

Author Contributions

Methodology and writing—original draft preparation, Y.H.; conceptualization, A.F.; writing—review and editing, A.F., Z.G. and X.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the Sichuan Science and Technology Program with award numbers 2023YFS0453 and 2024YFFK0251.

Data Availability Statement

The original data presented in this study are openly available in [Hugging Face] at [https://huggingface.co/datasets/nyu-mll/glue] accessed on 4 January 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Implementation Details

The search space of hyperparameters considered in this paper is shown in (Table A1).

Table A1. Experiment hyperparameter settings.

Hyperparameter	RoBERTa		GPT2
Hyperparameter	Full-Data	Few-Shot	Full-Data	Few-Shot
#Layers	24	24	136	36
Hidden size	1024	1024	1280	1280
Dropout rate	0.1	0.1	0.1	0.1
Peak learning rate	1.00 × 10⁻³	1.00 × 10⁻³	1.00 × 10⁻³	1.00 × 10⁻³
Warmup rate	0.06	0.06	0.06	0.06
Batch size	{16,32}	{8,16,32}	{8,16}	{4,8,16}
Weight decay	0.1	0.1	0.1	0.1
Training step	\	500	\	500
Training epoch	10	\	10	\
num_prompt_tokens	10	10	10	10
proj_down_size	64,128,256	64,128,256	64,128,256	64,128,256

References

Hittawe, M.M.; Harrou, F.; Togou, M.A.; Sun, Y.; Knio, O. Time-series weather prediction in the Red sea using ensemble transformers. Appl. Soft Comput. 2024, 164, 111926. [Google Scholar] [CrossRef]
Harrou, F.; Zeroual, A.; Hittawe, M.M.; Sun, Y. Chapter 6—Recurrent and convolutional neural networks for traffic management. In Road Traffic Modeling and Management; Harrou, F., Zeroual, A., Hittawe, M.M., Sun, Y., Eds.; Elsevier: Amsterdam, The Netherlands, 2022; pp. 197–246. [Google Scholar]
Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach. 2020, 30, 681–694. [Google Scholar] [CrossRef]
Fedus, W.; Zoph, B.; Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res. 2022, 23, 1–39. [Google Scholar]
Lialin, V.; Deshpande, V.; Rumshisky, A. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv 2023, arXiv:2303.15647. [Google Scholar]
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
Fu, Z.; Yang, H.; So, A.M.-C.; Lam, W.; Bing, L.; Collier, N. On the effectiveness of parameter-efficient fine-tuning. Proc. AAAI Conf. Artif. Intell. 2023, 37, 12799–12807. [Google Scholar] [CrossRef]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.-M.; Chen, W. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv 2022, arXiv:2203.06904. [Google Scholar]
Brian, L.; Rami, A.; Noah, C. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 3045–3059. [Google Scholar]
Li, X.L.; Liang, P. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 4582–4597. [Google Scholar]
Houlsby, N.; Giurgiu, A.; Jastrzebski, S.; Morrone, B.; De Laroussilhe, Q.; Gesmundo, A.; Attariyan, M.; Gelly, S. Parameter-efficient transfer learning for NLP. In Proceedings of the International Conference on Machine Learning, Beach, CA, USA, 9–15 June 2019; pp. 2790–2799. [Google Scholar]
Rckl, A.; Geigle, G.; Glockner, M.; Beck, T.; Pfeiffer, J.; Reimers, N.; Gurevych, I. AdapterDrop: On the Efficiency of Adapters in Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7930–7946. [Google Scholar]
Karimi Mahabadi, R.; Henderson, J.; Ruder, S. Compacter: Efficient Low-Rank Hypercomplex Adapter Layers. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; pp. 1022–1035. [Google Scholar]
He, S.; Ding, L.; Dong, D.; Zhang, J.; Tao, D. SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 2184–2190. [Google Scholar]
Zaken, E.B.; Ravfogel, S.; Goldberg, Y. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv 2021, arXiv:2106.10199. [Google Scholar]
Liu, X.; Sun, T.; Huang, X.; Qiu, X. Late Prompt Tuning: A Late Prompt Could Be Better Than Many Prompts. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 1325–1338. [Google Scholar]
Zhu, W.; Tan, M. SPT: Learning to Selectively Insert Prompts for Better Prompt Tuning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 11862–11878. [Google Scholar]
Razdaibiedina, A.; Mao, Y.; Khabsa, M.; Lewis, M.; Hou, R.; Ba, J.; Almahairi, A. Residual Prompt Tuning: Improving prompt tuning with residual reparameterization. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 6740–6757. [Google Scholar]
Xiao, Y.; Xu, L.; Li, J.; Lu, W.; Li, X. Decomposed Prompt Tuning via Low-Rank Reparameterization. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, 6–10 December 2023; pp. 13335–13347. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar]
Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
Lagler, K.; Schindelegger, M.; Böhm, J.; Krásná, H.; Nilsson, T. GPT2: Empirical slant delay model for radio space geodetic techniques. Geophys. Res. Lett. 2013, 40, 1069–1073. [Google Scholar] [CrossRef] [PubMed]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
Aghajanyan, A.; Zettlemoyer, L.; Gupta, S. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 7319–7328. [Google Scholar]
Zhang, A.; Tay, Y.; Zhang, S.; Chan, A.; Luu, A.T.; Hui, S.C.; Fu, J. Beyond fully-connected layers with quaternions: Parameterization of hypercomplex multiplications with 1/n parameters. arXiv 2021, arXiv:2102.08597. [Google Scholar]
Peters, M.E.; Ruder, S.; Smith, N.A. To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; pp. 7–14. [Google Scholar]
Vu, T.; Lester, B.; Constant, N.; Al-Rfou, R.; Cer, D. SPoT: Better Frozen Model Adaptation through Soft Prompt Transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5039–5059. [Google Scholar]
Asai, A.; Salehi, M.; Peters, M.; Hajishirzi, H. ATTEMPT: Parameter-Efficient Multi-task Tuning via Attentional Mixtures of Soft Prompts. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 6655–6672. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Gu, Y.; Han, X.; Liu, Z.; Huang, M. Ppt: Pre-trained prompt tuning for few-shot learning. arXiv 2021, arXiv:2109.04332. [Google Scholar]
Edalati, A.; Tahaei, M.; Kobyzev, I.; Nia, V.P.; Clark, J.J.; Rezagholizadeh, M. Krona: Parameter efficient tuning with kronecker adapter. arXiv 2022, arXiv:2212.10650. [Google Scholar]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. QLoRA: Efficient Finetuning of Quantized LLMs. In Proceedings of the Advances in Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023; pp. 10088–10115. [Google Scholar]
Lv, K.; Yang, Y.; Liu, T.; Gao, Q.; Guo, Q.; Qiu, X. Full parameter fine-tuning for large language models with limited resources. arXiv 2023, arXiv:2306.09782. [Google Scholar]
Nawrot, P.; Chorowski, J.; Łańcucki, A.; Ponti, E.M. Efficient transformers with dynamic token pooling. arXiv 2022, arXiv:2211.09761. [Google Scholar]
Wiebe, J.; Wilson, T.; Cardie, C. Annotating expressions of opinions and emotions in language. Lang. Resour. Eval. 2005, 39, 165–210. [Google Scholar] [CrossRef]
Pang, B.; Lee, L. Seeing Stars: Exploiting Class Relationships for Sentiment Categorization with Respect to Rating Scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), Ann Arbor, MI, USA, 25–30 June 2005; pp. 115–124. [Google Scholar]
Pang, B.; Lee, L. A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04), Barcelona, Spain, 21–26 July 2004; pp. 271–278. [Google Scholar]
Voorhees, E.M.; Tice, D.M. Building a question answering test collection. In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, 24–28 July 2000; pp. 200–207. [Google Scholar]
Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; Tang, J. P-Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 61–68. [Google Scholar]
Wu, Z.; Wang, S.; Gu, J.; Hou, R.; Dong, Y.; Vydiswaran, V.G.V.; Ma, H. IDPG: An Instance-Dependent Prompt Generation Method. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 5507–5521. [Google Scholar]

Figure 1. Overview of LIPT framework. Yellow represents trainable (tunable) modules, while blue represents frozen (non-trainable) modules. Left: A prompt generator with three bottleneck branches. The initialized prompt passes through three bottleneck branches of varying sizes and is then added back to the initial prompt (this connection pattern is similar to the Inception architecture). Right: The architecture of the transformer-based base model, along with the forward propagation and backpropagation paths. During forward propagation, the prompt generator on the left is inserted into the specified “Prompt Layer” of the model on the right in parallel. The generated prompt is concatenated with the output of the Prompt Layer and passed to the next layer. During backpropagation, only the prompt generator network is fine-tuned.

Figure 2. Evaluate a one-sentence task and a two-sentence task. (Left): Comparison of five types of initialization methods. (Right): Add a self-connecting effect.

Figure 3. LIPT performance on different tasks. (Left): Comparison of bottleneck numbers. (Right): Different module sizes.

Figure 4. Performance trends of two single-sentence tasks and two two-sentence tasks at different insertion layers. The RoBERTa-large backbone model is used, selecting even-numbered layers between the 10th and 24th layers. The shaded area shows the mean and standard deviation of 3 different random runs.

Table 1. Comparison of parameter efficiency, training efficiency, memory cost, and computational efficiency in terms of reducing backpropagation for all the methods on RoBERTa-large backbone models. All methods are evaluated on the RTE dataset. Bold and underlined scores indicate the best and the second-best results.

Method	Tuable Parameters	Training Speed	Memory Cost	Backprop	EI
Method	Tuable Parameters	Token/ms (↑)	GB (↓)	Backprop	EI
Model Tuning	355 M	11.6	23.5	no	1.06
Adapter	1.6 M	15.5 (1.3×)	16.5 (29.8%)	no	1.62
AdapterDrop	811 K	21.6 (1.9×)	9.5 (59.6%)	no	2.07
Prompt Tuning	21 K	16.9 (1.5×)	17.8 (24.3%)	no	0.51
P-Tuning V2	985 K	19.2 (1.7×)	16.8 (28.5%)	no	1.29
S-IDPG-PHM	114 K	12.0 (1.0×)	16.8 (28.5%)	no	0.49
BitFit	273 K	16.5 (1.4×)	15.7 (33.2%)	no	2.03
LoRA	788 K	16.4 (1.4×)	16.2 (31.1%)	no	1.88
LPT	792 K	23.2 (2.0×)	15.5 (34.0%)	yes	1.54
LIPT	931 K	25.7 (2.2×)	14.1 (40.0%)	yes	2.15

Table 2. The statistics of datasets evaluated in this work. For the MNLI task, the number of samples in development and test sets is summed by matched and mismatched samples. |Y| is the number of classes.

Category	Datasets	\|Train\|	\|Dev\|	\|Test\|	\|Y\|	Type	Labels
Single-sentence	SST-2	67,349	872	1821	2	sentiment	positive, negative
	MPQA	7606	1000	2000	2	opinion polarity	positive, negative
	MR	7662	1000	2000	2	sentiment	positive, negative
	Subj	7000	1000	2000	2	subjectivity	subjective, objective
	Trec	4952	500	500	6	question cls.	abbr., entity, description, human, loc., num.
Sentence pair	MNLI	392,702	19,647	19,643	3	NLI	entailment, neutral, contradiction
	MRPC	3668	408	1725	2	paraphrase	equivalent, not equivalent
	QNLI	104,743	5463	5463	2	NLI	entailment, not entailment
	QQP	363,846	40,430	390,965	2	paraphrase	equivalent, not equivalent
	RTE	2490	277	3000	2	NLI	entailment, not entailment

Table 3. Performance comparison under the full-data scenario. We report the mean and standard deviation of performance over three different random seeds for all methods. Bold and underlined scores indicate the best and the second-best results.The background color transitions from light to dark, indicating an increase in the number of parameter RoBERTa-large is the pre-trained backbone for all results.

Method	Tunable Parameters	SST-2	MPQA	MR	Subj	TREC	MNLI	MRPC	QNLI	QQP	RTE	Avg
Method	Tunable Parameters	(acc)	(acc)	(acc)	(acc)	(acc)	(acc)	(acc and F1)	(acc)	(acc and F1)	(acc)	Avg
Model Tuning ^‡	355 M	95.6	90.2	91.3	96.8	97.6	89.3	91.2	94.6	90.7	86.2	92.4
Adapter ^‡	1.6 M	96.2 (0.2)	89.2 (0.5)	91.6 (0.4)	96.8 (0.4)	97.0 (0.3)	90.5 (0.1)	90.3 (1.0)	94.7 (0.3)	89.4 (0.7)	85.5 (1.2)	92.3
AdapterDrop ^‡	811 K	95.3 (0.3)	89.1 (0.7)	91.0 (0.5)	95.3 (0.6)	95.7 (0.5)	88.5 (0.2)	90.1 (1.3)	93.3 (0.3)	88.3 (0.3)	81.1 (2.0)	90.8
BitFit ^‡	273 K	95.9 (0.1)	89.2 (0.9)	91.8 (0.5)	96.9 (0.1)	96.2 (0.3)	90.0 (0.1)	89.6 (0.9)	94.4 (0.2)	87.9 (0.4)	82.4 (1.1)	91.4
LoRA ^‡	788 K	96.2 (0.3)	90.1 (0.3)	92.0 (0.1)	97.1 (0.4)	96.8 (0.6)	89.8 (0.3)	91.1 (0.6)	94.8 (0.2)	89.8 (0.1)	84.8 (2.1)	92.3
Prompt Tuning ^‡	21 K	94.9 (0.5)	88.8 (0.8)	89.6 (0.5)	93.9 (0.6)	86.4 (0.7)	86.7 (0.9)	75.7 (0.7)	91.4 (0.1)	81.2 (0.8)	60.8 (0.5)	84.9
P-tuning v2 ^‡	985 K	95.8 (0.4)	89.9 (0.6)	91.4 (0.4)	96.5 (0.2)	95.8 (0.6)	88.2 (0.2)	86.5 (2.1)	93.7 (0.3)	85.3 (0.2)	66.9 (2.3)	89.0
S-IDPG-PHM ^‡	114 K	94.8 (0.3)	89.5 (0.6)	90.8 (0.5)	95.9 (0.6)	89.3 (0.4)	87.4 (0.5)	77.3 (1.2)	91.2 (0.4)	82.3 (1.9)	62.7 (1.9)	86.1
LPT	792 K	95.26 (0.3)	90.90 (0.2)	91.07 (0.6)	96.93 (0.4)	89.53 (1.9)	86.01 (0.2)	84.34 (1)	90.99 (0.4)	83.5 (0.1)	77.86 (0.6)	88.6
LIPT	931 K	95.07 (0.3)	91.03 (0.1)	91 (0.5)	97.73 (0.4)	94 (0.9)	86.88 (0.2)	87.68 (1.3)	91.06 (0.2)	85.25 (0.1)	79.78 (1.3)	90.0

^‡ indicates that the data for this method are taken from the LPT paper.

Table 4. Performance comparison under the full-data scenario, with GPT2-large as the backbone. Bold and underlined scores indicate the best and the second-best results.The background color transitions from light to dark, indicating an increase in the number of parameters.

Method	Tunable Parameters	Subj (acc)	TREC (acc)	MRPC (acc and F1)	RTE (acc)	Avg
Model Tuning	774 M	97.2	97	88	75.8	89.5
Prompt Tuning	26 K	88.8 (1.0)	82.7 (1.1)	75.1 (0.5)	53.7 (1.3)	75.1
LPT	990 K	96.67 (0.1)	91.67 (0.3)	80.86 (0.6)	68.59 (0.6)	84.45
LIPT	1.2 M	96.83 (0.2)	90.73 (0.2)	80.63 (1.0)	68.71 (1.8)	84.23

Table 5. Performance comparison under the few-shot scenario. We report the mean and standard deviation of performance over 5 different random seeds for all the methods. Bold and underlined scores indicate the best and the second best results, respectively. All results use RoBERTa-large as the pre-trained backbone. Bold and underlined scores indicate the best and the second-best results.The background color transitions from light to dark, indicating an increase in the number of parameters.

Method	Tunable Parameters	SST-2	MPQA	MR	Subj	TREC	MNLI	MRPC	QNLI	QQP	RTE	Avg
Method	Tunable Parameters	(acc)	(acc)	(acc)	(acc)	(acc)	(acc)	(acc and F1)	(acc)	(acc and F1)	(acc)	Avg
Model Tuning	355 M	91.4 (0.8)	87.2 (1.1)	89.4 (0.6)	95.1 (0.4)	95.4 (0.5)	75.3 (2.1)	85.1 (1.8)	85.2 (0.9)	77.3 (1.2)	67.0 (7.7)	84.8
Prompt Tuning	21 K	91.1 (1.5)	74.7 (5.1)	88.3 (0.6)	86.4 (0.4)	81.7 (2.4)	45.5 (1.5)	74.6 (0.3)	58.1 (1.6)	52.6 (5.8)	61.2 (1.7)	71.4
P-tuning v2	985 K	91.3 (0.3)	85.1 (1.6)	88.0 (1.5)	94.5 (0.4)	94.6 (0.8)	61.6 (2.7)	76.6 (1.8)	73.7 (2.4)	71.7 (1.8)	56.0 (1.1)	79.3
S-IDPG-PHM	114 K	91.3 (0.5)	75.9 (3.8)	88.7 (0.4)	87.2 (0.6)	84.7 (2.1)	46.3 (1.1)	75.1 (0.8)	59.4 (0.7)	56.4 (3.0)	64.7 (1.7)	73
LPT	792 K	91.7 (0.8)	89.94 (1.2)	89.42 (0.6)	93.65 (1.1)	83.61 (2.6)	69.44 (4.3)	79.16 (2.0)	79.32 (1.9)	76.25 (1.0)	72.28 (1.5)	82.48
LIPT	931 K	91.18 (0.2)	89.16 (0.5)	88.96 (1.1)	94.21 (1.1)	85.69 (1.1)	71.76 (1.7)	78.79 (1.3)	76.44 (1.1)	74.9 (1.2)	74.32 (2.2)	82.54

Table 6. The LXPT and LIPT methods are compared on two backbone models. Bold indicate the best results.

RoBERTa-large
Method	Subj	TREC	MRPC	RTE	Avg
LXPT	97.5 (0.6)	93.6 (0.5)	86.91 (1.2)	78.34 (0.4)	89.1
LIPT	97.73 (0.4)	94 (0.9)	87.68 (1.3)	79.78 (1.3)	89.8
GPT2-large
Method	Subj	TREC	MRPC	RTE	Avg
LXPT	96.8 (0.1)	89.13 (0.9)	80.29 (1.2)	67.63 (3.8)	83.46
LIPT	96.83 (0.2)	90.73 (0.2)	80.63 (1.0)	68.71 (1.8)	84.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Y.; Feng, A.; Gao, Z.; Song, X. LIPT: Improving Prompt Tuning with Late Inception Reparameterization. Electronics 2024, 13, 4741. https://doi.org/10.3390/electronics13234741

AMA Style

He Y, Feng A, Gao Z, Song X. LIPT: Improving Prompt Tuning with Late Inception Reparameterization. Electronics. 2024; 13(23):4741. https://doi.org/10.3390/electronics13234741

Chicago/Turabian Style

He, Yawen, Ao Feng, Zhengjie Gao, and Xinyu Song. 2024. "LIPT: Improving Prompt Tuning with Late Inception Reparameterization" Electronics 13, no. 23: 4741. https://doi.org/10.3390/electronics13234741

APA Style

He, Y., Feng, A., Gao, Z., & Song, X. (2024). LIPT: Improving Prompt Tuning with Late Inception Reparameterization. Electronics, 13(23), 4741. https://doi.org/10.3390/electronics13234741

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu