4.1 Data Augmentation
One common way to deal with the lack of data is
data augmentation, which consists in increasing the size of the available dataset with new samples generated by means of heuristics or external data sources. Augmentation methods explored in current literature for NLP tasks usually manipulate words in the original sentence by word replacement [
13], random deletion [
160], word position swap [
93], and generative models [
171]. Applying these transformations to NER input samples is not possible due to the token-level classification implied by this task (each manipulation impacts labels). Thus, data augmentation techniques for NER are comparatively less studied [
25]. In the following, we describe current methods applying data augmentation in few-shot scenarios. We provide an example of an augmented sample for each method in Figure
11.
Data Boost [
85] explores the text generation ability of Language Models to generate augmented samples. In particular, GPT-2 [
113] is used as a conditional generator and guided toward specific class labels by means of a
Reinforcement Learning (
RL) approach. A RL stage is added in between the softmax and argmax function of the conditional generator. The
state at step
\(t\) is the generated sentence before
\(t\) \(s_t = \mathbf {x}_{\lt t}\), where
\(\mathbf {x}_{\lt t} = \lbrace x_0, x_1, \dots , x_{t-1}\rbrace\); the
policy \(\pi _\theta\) is the probability that token
\(x_t\) is chosen (
action \(a_t\)), i.e., the softmax output of the hidden states
\(\pi _\theta (a_t|s_t) = softmax(h^\theta _{\lt t})\).
Reward, following
Proximal Policy Optimization (
PPO) [
130] for a given conditional token
\(x_t^c\) is computed as follows:
where
\(G(x_t^c)\) is the
salience gain which quantifies the generated token resemblance to target label lexicon by a logarithm summation of cosine similarity with each word in the salient lexicon; KL is the
Kullback-Leibler (
KL) divergence between conditional
\(\theta _c\) and unconditional
\(\theta\) distributions;
\(\beta\) is a weighting parameter. However, this method has been conceived for text data augmentation, and does not directly apply to NER, as it does not provide the mapping of the entity mention from the original to the augmented sentence.
Counterfactual Generator 14 [
176] addresses the poor generalization ability of few-shot systems to spurious correlations between the entities and their contexts. For example, in the sentence “John lives in New York”, the entity “New York” and its context (“John lives in”) are highly correlated, but it is not true that one
causes the other. Thus, based on the idea that entities and contexts are not in causal relationships, Zeng et al. [
176] provide a framework to generate weakly-labeled counterfactual examples in few steps: (1) a vocabulary with the desired entity type is prepared, (2) an entity is randomly selected from the input sentence and (3) replaced with a different entity from the entity set to form a counterfactual example; finally, (4) a discriminator model is trained to distinguish good examples from counterfactuals: if the discriminator, i.e., a NER model, can correctly recognize the replaced entity, the counterfactual example is considered in the new training set.
Dai et al. [
25] investigate the adaptation of many simple data augmentation methods for NER problems. They are listed as follows:
—
Label-wise Token Replacement (LwTR): we randomly decide if a token from the input sentence has to be replaced. If yes, it is randomly replaced with another available token with the same label.
—
Synonym Replacement (SR): similar to LwTR, but tokens are replaced with synonyms retrieved from WordNet.
—
Mention Replacement (MR): similar to LwTR, but only applied to mentions, which are randomly replaced with other mentions from the original training set with the same type.
—
Shuffle within Segments (SiS): sentences are split into segments with the same label, and tokens are randomly shuffled within these segments.
Experimental results show that all the investigated data augmentation techniques are effective in few-shot settings (e.g., 50- and 100-shot scenarios), but they often fail on full datasets due to the degrading effects of noise.
COSINER [
7] is a more sophisticated variation of
Mention Replacement (
MR) [
25]. It employs the cosine similarity to replace mentions with those that are the most similar, as opposed to randomly replacing them. Concept embeddings
\(V_{concept}\) extracted from each mention in the training set are used to calculate similarity and computed as follows:
where
and
\(V_{context}\) being the embedding extracted from a transformer network used as a feature extractor. The similarity-based nature of this method shows improvements in performance w.r.t. techniques based on random replacements.
StyleNER 15 [
18] learns patterns (e.g., style, noise, abbreviations) that differentiate data from high-resource and low-resource domains, and a shared space where they are aligned. The key idea is that text data may differ in textual patterns (e.g., long and complex sentences from research abstracts versus social media data), but their semantics are transferable. Given a source domain
\(D_{src}\) and a target domain
\(D_{tgt}\) dataset, sentences are linearized by inserting entity labels before their corresponding text span, and a random pair of sentences is then extracted from the two datasets and provided as input to the model. The model works in two steps:
—
Denoising reconstruction: the model learns input embeddings from their corresponding domain. Inputs are perturbed in several ways (e.g., shuffling, dropping, masking) to inject noise, and the model has to capture semantics and learn patterns that differentiate sentences across domains. Then, a decoder reconstructs noisy sentences to their corresponding domain.
—
Detransforming reconstruction: based on their semantics, sentences are transformed from one domain to the other. Then, the model generates embeddings for transformed sentences and learns to reconstruct each of them in their corresponding domain.
A discriminator is also trained to distinguish the domain of an embedded sentence, which allow to determine if the encoder can generate meaningful representations or the model has to bypass the intermediate mapping step between domains.
4.1.1 Performance Evaluation.
The experimental results summarized in Table
3 offer empirical proof to back up the claim that using data augmentation methods consistently results in performance gains. Moreover, the data suggest that limited contexts, such as those encountered in few-shot scenarios, can benefit even more from data augmentation. It is worth noting, however, that comparing the effectiveness of various data augmentation techniques across different datasets is difficult. Nevertheless, the results obtained on OntoNotes
16 by ProtoNER [
39] presented in Table
2 suggest that meta-learning outperforms data augmentation techniques, achieving better performance with only 20 training samples. This finding suggests that while data augmentation alone is probably worse than model-centric approaches, combining data augmentation with specially designed methods for few-shot learning may lead to even better performance.
4.2 Distant Supervision
In few-shot scenarios, labeled data could be retrieved from heuristics, different domains or laguages, external knowledge bases, or ontologies.
Distant supervision [
94] aims at leveraging such resources to heuristically annotate training data. For example, in biomedicine there are a lot of curated resources available:
NCBO Bioportal [
161] houses 541 biomedical ontologies,
Medical Subject Headings(MeSH)17 is a controlled vocabulary with 347,692 classes of medical items, and so on. Combining ontologies is a difficult task due to their heterogeneous structures, concept granularities and overlaps or conflicts between definitions of entities. Generally, the main steps of distant supervision are (1)
candidate generation, i.e., the identification of potential entities, and (2)
labeling heuristics to generate noisy labels, as shown in the example of Figure
12. The use of distant supervision for FS-NER in low-resource languages has yet to be deeply explored. The amount of external information available in low-resource settings might be very limited: for example, the Wikipedia knowledge graph contains 4 million person names in English while only 32 thousand in Yorùbá [
1]. Furthermore, without further tuning under better supervision, distantly supervised models have low recall [
14]. In the following, we describe methods that apply distant supervision to NER tasks.
SwellShark 18 [
38] aims at automatizing the generation of candidates and noisy labels without hand-labeled data. Its inputs are (1) a collection of unlabeled documents and (2) some form of weak supervision, typically ontologies and heuristic rules. The candidate generator
\(\Gamma _\Theta\) is defined as the mapping function from a document collection
\(D\) into a candidate set
\(\Gamma _\Theta : D \rightarrow \lbrace x_1, x_2, \dots , x_N\rbrace\), where each candidate
\(x_i\) is a character-level span within the document. Candidates in SwellShark are determined by heuristics, e.g., dictionary of noun phrases matching with regular expression. To filter actual entities from candidates, a
labeling function generator \(\Gamma _\lambda\) is defined as a function which receives a resource
\(R\) for weak supervision (e.g., ontology, term frequencies) and generates labeling functions
\(\Gamma _\lambda : R \rightarrow \lbrace \lambda _1, \lambda _2, \dots , \lambda _N\rbrace\) (e.g., lexicon matching, frequency-based thresholds).
Choosing heuristics intrinsically involves a tradeoff between development time and performance: greedy heuristics such as dictionary matching often imply low recall, while noun phrases candidates generate large sets which usually consist in an increase of noise, and hand-tuned heuristics may indeed result in high performance, but require more efforts.
AutoNER 19 [
134] handles the problems of
incompleteness and
noise in automatically-labeled NER data, which characterize methods based on the detection of entity spans with heuristic rules, such as regular expressions [
38,
120] and exact string matching [
43,
51]. Nevertheless, unmatched tokens in are simply ignored when entities are not covered by the used ontology, introducing many false-negative labels (incompleteness). Furthermore, these methods often require expensive expert effort to cover many special cases. To handle these problems, AutoNER marks some high-quality out-of-dictionary phrases as “potential entities” without requiring human effort. In particular, to leverage the information embedded in dictionaries, AutoNER proposes the
tie or break tagging scheme, which tags two adjacent tokens as (1)
tie if they belong to the same entity type, (2)
unknown if at least one of them belongs to an unknown-type phrase, and (3)
break otherwise. This scheme is used for entity span detection, while entity types are then identified based on feature vectors of candidates. Predicting whether two adjacent tokens refer to the same entity or not, AutoNER builds more robust distant supervised models (since it is often noisy on boundaries of entity mentions but not in their inner ties).
One of the main advantages of AutoNER is that entity mentions may be marked as
unknown, allowing us to include token which we are unable to identify their types based on distant supervision. For example, “prostaglandin synthesis” may be present in both disease and chemical lexicons, or lexicon may not cover all the possible entity types. AutoPhrase [
133] is used to automatically mine new phrases from unlabeled text and a dictionary of high-quality phrases. All the out-of-dictionary phrases are then labeled as
unknown and added to the dictionary.
Yang et al. 20 [
167] also addresses the
incomplete and
noisy annotation problems. Weak sentences are obtained by using a dictionary
\(D\) built from named entities available in the training dataset
\(H\) to weakly annotate a large unlabeled pool of sentences. Note that in this work “distant” resources are used to augment the training set, but the dictionary is built by relying on the available data only, which may be a limit in highly-constrained few-shot settings, where training data do not cover all the possible named entities. To handle the
incomplete annotation problem, sentences are allowed to have partial annotations: some token spans are annotated with definite labels, while all the others are associated to all the possible labels. The
noisy annotations problem is handled with a reinforcement learning approach which follows Feng et al. [
34] to obtain clean instances from distantly supervised NER data. In particular, the state
\(\mathbf {S_t}\) is a vector containing: (1) representation of the current sentence obtained with a BiLSTM layer, (2) label scores computed with a MLP layer from the shared encoder, and (3) distant annotation of the instance. The action
\(a_t \in \lbrace 0,1\rbrace\) indicates whether to select the
\(t\)th distantly supervised sentence, and the policy is learned by optimizing NER performance (reward).
Knowledge-Augmented Language Model (KALM) [
80] augments a traditional language model with a knowledge base without requiring any additional component; in addition, it learns to recognize entities in an entirely unsupervised way by using entity type information which is latent in the model. In particular, KALM has the ability to predict masked words from a vocabulary
\(V_g\) like any other language model, but it has a separate vocabulary
\(V_i\) for each entity type and is able to predict whether to expect an entity from the context. Formally, given a latent variable
\(\tau _i\) denoting the entity type
\(i\) and previously observed words
\(c_t = [y_1, y_2, \dots , y_{t-1}, y_t]\), the probability of the next word is computed as follows:
where
\(P(y_{t+1} | \tau _{t+1}, c_t)\) is the distribution of entity words of type
\(\tau _{t+1}\) and
\(P(\tau _{t+1}=j | c_t)\) is the probability that the next word has a given type
\(j\). Both are computed as in standard language models, i.e., by projecting the hidden state of a LSTM model and normalizing with a softmax. However, the base model is enhanced by using as input not only the embedding vector of an input word, but also the embedding of the type of the previous word: in this way, KALM is able to model context by taking count of entity types, and it allows to learn latent types more accurately during training.
Cao et al. 21 [
14] try to maximize the potential of automatically weakly labeled data (e.g., anchors from Wikipedia) by dealing with the
incomplete and
noisy annotations problems. To obtain a high recall, the framework generates as many weakly-labeled data as possible with a
label induction approach assigning labels to words based on Wikipedia anchors and taxonomy, and a
data selection scheme which computes a scoring function to distinguish high-quality data from weak sentences and a neural model is then trained on such data. For data selection, the same approach as Ni et al. [
102], which is based on annotation confidence and coverage, is applied. For sequence labeling from high-quality weakly-labeled data, Partial CRFs [
146] are employed, while a
classification module regards name tagging from noisy weakly-labeled data as a multi-label classification problem predicting each word label separately. The first layers of the two network are shared so as to allow knowledge distillation.
Graph Attention Model 22 [
87] uses a domain-specific 2dictionary to create a
word formation graph which captures variants of entities and thus discovers as new entity mentions as possible. In particular, the vertex set contains all words in the dictionary, which are connected by undirected edges if they appear in the same entity type. Mention candidates are then extracted with a graph-matching algorithm. After extracting all the candidates, a
word-mention graph integrates word formation of candidate entities into their sentences: the vertex set contains words in the input sentence and mention candidates extracted. In this way, links between words allows us to capture contextual information, while word-mention links capture the semantic of mentions. Graph information is then leveraged by a learning model including (1) a word embedding layer to represent words, (2) a BiLSTM layer to capture contextual information, and (3) a
graph attention network (
GAT) [
151] to incorporate the information of mention candidates.
Linked HMM 23 [
123] claims that the standard approach of generating candidate spans and then independently labeling each candidate, as in SwellShark [
38], limits the applicability to tasks where candidate generators exist, and increases human efforts. Furthermore, candidate generators should identify all the possible entity spans, since errors propagate in the pipeline. Hence, Linked HMM allows users to write multiple rules which provide partial tags to sequences, whose accuracy is estimated by using an identifiable probabilistic generative model
without labeled training data. The estimated posterior distribution is used over the true tags to train a sequence tagger. The rules which can be provided are categorized into two types: (1)
tagging rules, which vote on the correct tags of sequence elements (they take a training input sequence and output a sequence of the same length indicating their votes on the true tags); (2)
linking rules, which vote on whether an adjacent element should have the same or different tag (they allow distant supervision to propagate along sequences in an user-controlled way). These two types of rule are then used by a
linked Hidden Markov Model to estimate the true tags for training by reconciling incomplete and conflicting information from multiple rules provided by users.
Consensus Network 24 [
67] can be trained on imperfect annotations from multiple sources (e.g., crowd annotations, cross-domain data) by learning representations for each source and dynamically aggregating them by a context-aware attention mechanism. This is based on the intuition that different sources of supervision may have different strengths based on scenarios where they are applied. The framework first uses a multi-task learning schema based on a BiLSTM-CRF network to
decouple model parameters into a set of annotator-invariant model parameters and a set of annotator-specific representations, then trains a context-aware attention module for a consensus-based representation by combining predictions on the target data. Specifically, scores from the annotator
\(k\) are obtained by combining emission and transformation score matrices from the BiLSTM-CRF network with annotator-dependent matrix
\(\mathbf {A}^{(k)}\) which represents the pattern of annotation bias, i.e., the entry
\(\mathbf {A}^{(k)}_{ij}\) is the probability of assigning the wrong label
\(j\) instead of the correct label
\(i\). Scores from different annotators are then combined with weighted voting, where the weight is given by F1 scores on the training set, and an attention module is added to improve generalization, since it provides more weight to sources which are more related to the input sentence.
4.2.1 Performance Evaluation.
The experimental findings summarized in Table
4 demonstrate that distant supervision techniques can yield results that are comparable in quality to gold labeling at a significantly lower cost. The development of annotation functions, heuristics, dictionaries, or rules can serve as a cost-effective alternative to time-consuming and expensive labeling processes, even though some human effort may still be required. Moreover, combining weak labels with dictionaries has been found to be more effective than relying on partial training data or techniques such as data augmentation, as can be seen by comparing results obtained on biomedical datasets (i.e., NCBI-Disease [
32] and BC5CDR [
77]) with those presented in Tables
3,
5. Furthermore, the results obtained using the KALM method [
80] for CoNNL [
125] demonstrate its superiority over transfer-learning methods (refer to Table
1) that rely on a limited number of labeled samples. In fact, KALM [
80] yields outcomes that are almost as accurate as those generated using gold labels as shown in Table
5, hence highlighting the efficiency of weak labeling methods in minimizing labeling costs while retaining optimal quality.
4.4 Self-training
Self-training, also referred to as
self-learning, is an approach similar to distant supervision, where the model is trained on examples labeled by the model itself. As shown in Figure
14, the difference is that the labeling heuristics is replaced by the model itself. Originally proposed by Scudder et al. [
131], it is one of the earliest semi-supervised methods. In the NLP field, it has been successfully applied to neural machine translation [
50] and sentence classification [
99]. Zoph et al. [
183] show that self-training guarantees improvements in performance in both high- and low-data scenarios, while data augmentation results sometimes in decreases of performance of pretraining. In the following, we explore self-training approaches for FS-NER.
LM-LSTM-CRF 25 [
84] leverages the knowledge obtained with a language model trained in an unsupervised fashion to improve sequence-labeling performance. It leverages word-level and character-level information of input-samples in a co-training fashion, i.e., each input is represented by different sets of features, each of them providing complementary information. Both a language model and a sequence-labeling model share the same character-level layer, which consists in two LSTM units learning character-level information in a completely unsupervised way to capture style and structure of input texts. However, since the two task handled are not strongly related, empirical results show that this can hurt the overall performance. Hence, outputs of character-level layers are transformed into task-specific spaces, so that the language model can indirectly provide its knowledge to the NER model thanks to the shared layer, without sharing its feature space.
Chen et al. [
15], considering that performance of self-training highly depends on how new data are selected, use a reinforcement learning to learn how to select instances from an unlabeled pool to be added to the training set in self-training scenarios. The NER model, which had been previously trained with few samples, will classify the new selected samples and be retrained. The self-training approach is considered as a decision process, described as a function that receives the self-labeled instances as inputs and outputs the acceptance or rejection of such instances to the training set. A
Deep Q-Network (
DQN) [
96] then learns the selection strategy based on performance improvements on a development set. The Q-function is implemented using a neural network with three layers which receives the state
\(s = (h_s, h_c, h_p, h_t)\) of the learning framework as input, where
\(h_s\) is a representation of the input sample,
\(h_c\) is the confidence of tagging the instance using the model,
\(h_p\) is the marginals of the prediction,
\(h_t\) is the hidden representation from the model. The Q-function
\(Q^\pi (s,a)\rightarrow R\) receives the state
\(s\) and an action
\(a\) as inputs and returns a reward as a result of the execution of
\(a\). The policy
\(\pi\) has the aim at maximizing the reward of actions. In this work, the reward is defined as the difference in NER performance when adding a new sample (i.e., the action) to the training dataset. This is a similar setting to Fang et al. [
33], who apply DQNs to find the best Active Learning policy.
RDANER 26 [
174] uses a bootstrapping approach to obtain model predictions on easily-obtainable unlabeled data and retrain the model with the augmented weak dataset. Firstly, a general domain pre-trained language model
\(\mathcal {M}\) is fine-tuned on the few-shot labeled corpus available
\(\mathcal {D}^L\). Then,
\(\mathcal {M}\) is used to annotate an unsupervised dataset
\(\mathcal {D}^U\) to obtain a weakly annotated dataset
\(\mathcal {D}_{weak}^U\) which is then combined with
\(\mathcal {D}^L\) to get the augmented corpus. A threshold
\(\theta\) is used to filter out tags with low probabilities assigned by
\(\mathcal {M}\). This process is iterated until the achievement of an acceptable level of accuracy or the maximum number of iterations. Experiments show that this simple approach has worse F1-scores than domain-specific pre-trained language models (e.g., BioBERT [
69], SciBERT [
8]), but it could be a good alternative in low-resource languages and domains where huge domain-specific language model variants are not available.
BOND 27 [
79] combines a self-training approach with distant supervision. During the first stage, distant labels are generated with external knowledge bases and a pre-trained BERT model is adapted to the distantly supervised NER task with early stopping. In the second stage, a teacher-student framework is employed, where the student model
\(\theta ^{stu}\) is trained with pseudo-labels generated by the teacher model
\(\theta ^{tea}\). The teacher is initialized with weights
\(\hat{\theta }\) learned during the first stage, while the student may be initialized with
\(\hat{\theta }\) or pre-trained BERT layers
\(\theta ^{BERT}\). At the
\(t\)-th iteration, the teacher generates weak labels which the student learns to fit. Then, teacher and student are updated
\(\theta ^{tea}_{(t+1)} = \theta ^{stu}_{(t+1)} = \hat{\theta }^{tea}_{(t)}\). In this way, pseudo-labels are progressively refined so that the student can exploit their knowledge and avoid overfitting.
Huang et al. 28 [
57] leverage the labeled training set
\(\mathcal {D}^L\) and all the available in-domain unlabeled samples
\(\mathcal {D}^U\) by resorting to the knowledge-distillation approach proposed by Xie et al. [
164] in image classification. The algorithm operates in three steps: (1) a teacher model
\(\theta ^{tea}\) is learned with labeled tokens
\(\mathcal {D}^L\); (2) the teacher model is used to generate soft labels
\(\mathbf {\hat{y}_i}\) on unlabeled tokens
\(\mathbf {x}_i \in \mathcal {D}^U\),
\(\mathbf {\hat{y}_i} = f_{\theta ^{tea}}(\mathbf {x}_i)\); (3) a student model
\(\theta ^{stu}\) is trained on labeled and unlabeled tokens with a composed Kullback-Leibler loss
\(\mathcal {L}\):
where
\(\lambda ^U\) is a weighting hyper-parameter. Experimental results show that self-training usually results in significant performance improvements also when combined with noisy supervision.
4.4.1 Performance Evaluation.
The experimental results shown in Table
5 highlight that the self-learning techniques discussed are the most commonly approaches used in zero-shot contexts among the ones reviewed in this work. However, due to the heterogeneity of the datasets, comparing techniques belonging to this category against each other is not always feasible. However, we can examine the results of BOND [
79] on the CoNNL dataset [
125] in a zero-shot context, which can be compared with those achieved by KALM [
80], which attains superior performance thanks to the usage of dictionaries. In addition, the results obtained by BOND [
79] in a zero-shot context are superior to those achieved by Hou et al. [
56] in the Transfer Learning category as shown in Table
1. We also observe that BOND [
79] achieves better performance in a zero-shot context as compared to ProtoNER [
39], despite the latter method utilizing 20 instances for training as shown in Table
2. Similarly, RDANER’s results on BC5CDR can be compared with COSINER in Table
3. Despite COSINER [
7] utilizing fewer training data (i.e., 2% of the original dataset), it presents better performance than RDANER [
174]. However, when dealing with 10% of NCBI-Disease [
32], RDANER [
174] performs better than COSINER [
7].
Huang et al. [
57] results for MIT-M [
83] and MIT-R [
82] can only be compared with those of Template-based NER [
22] and PromptSlotTagging [
55] in a 10-shot context. Huang et al [
57] model obtains superior performance on MIT-M [
83] with half the number of training samples and inferior results compared to Template-based NER [
22] on MIT-R [
82]. Furthermore, Huang et al. [
57] model displays a superior performance on ATIS [
49] than Template-based NER [
22]. On SNIPS [
21], the results outperform both Hou et al. [
56] (Table
1) and L-TapNet [
54] and Oguz et al. [
105] (Table
2). On i2b2-2010 data [
149], superior results as compared to Dai et al. [
24] confirming that data augmentation alone has a lower effect on the performance of the model w.r.t. to other few-shot learning approaches.
The results show that the size and diversity of the training data, as well as the specific techniques utilized, have significant effects on the model’s performance in zero-shot or few-shot contexts. These findings suggest a need for further research to investigate how to optimize the combination of training techniques, data variety, and model architecture to achieve improved results in these circumstances.