survey

Open access

Few-shot Named Entity Recognition: Definition, Taxonomy and Research Directions

Authors:

Vincenzo Moscato,

Marco Postiglione,

Giancarlo SperlíAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 5

Article No.: 94, Pages 1 - 46

https://doi.org/10.1145/3609483

Published: 09 October 2023 Publication History

PDF eReader

Abstract

Recent years have seen an exponential growth (+98% in 2022 w.r.t. the previous year) of the number of research articles in the few-shot learning field, which aims at training machine learning models with extremely limited available data. The research interest toward few-shot learning systems for Named Entity Recognition (NER) is thus at the same time increasing. NER consists in identifying mentions of pre-defined entities from unstructured text, and serves as a fundamental step in many downstream tasks, such as the construction of Knowledge Graphs, or Question Answering. The need for a NER system able to be trained with few-annotated examples comes in all its urgency in domains where the annotation process requires time, knowledge and expertise (e.g., healthcare, finance, legal), and in low-resource languages. In this survey, starting from a clear definition and description of the few-shot NER (FS-NER) problem, we take stock of the current state-of-the-art and propose a taxonomy which divides algorithms in two macro-categories according to the underlying mechanisms: model-centric and data-centric. For each category, we line-up works as a story to show how the field is moving toward new research directions. Eventually, techniques, limitations, and key aspects are deeply analyzed to facilitate future studies.

1 Introduction

Despite the ever-increasing amount of data, the difficulty or even impossibility to share private information (for example, in healthcare), and the manual efforts required to build high-quality datasets, impose the necessity to find alternative ways to training machine learning models with massive data [86, 101, 153, 181].

Named Entity Recognition (NER) is the task of identifying mentions of entities from unstructured text and classifying their type (e.g., person, organization, disease, drug). It is the fundamental task for many downstream applications — such as question answering [3], dialogue systems [16], and knowledge graph construction [2] — and, as a consequence, poor performance could degrade the quality of the overall system. The recent advancements in Natural Language Processing (NLP) [11, 30, 114] have enabled the achievement of remarkable performance improvements when large labeled training corpora are available.

However, real-world applications — which are not limited to the English language, where most of publicly available datasets reside — would often require the annotation of large amounts of documents to reach comparable results to the current literature, and this is highly expensive, time-consuming and prone to human errors, especially when domain knowledge is required to produce high-quality results. Furthermore, inconsistencies between annotation schemes of different datasets lead to models which, having being trained on a given dataset, do not work properly on another even if it refers to the same context [75].

To tackle these issues, a widespread interest toward few-shot learning has emerged [140, 142, 144] and various books and surveys on this theme have been published [59, 124, 141, 156, 157, 170, 178]. Meta learning and transfer learning, for example, are training paradigms which allow to train models so as parameters can be easily adapted to new tasks, and to initialize models with parameters from other domains or languages, respectively. Anyway, most of the works related to the few-shot research area are data-centric, i.e., they try to maximize the value derivable from data. For example, active learning aims at selecting the most informative examples from an unsupervised data source to be annotated, so as to get the best possible result from model training; on the other hand, distant supervision and self learning leverage unsupervised data sources to increase the size of the training set by using heuristics or the model itself to annotate examples.

Reviewing the rich and articulate state-of-the-art in the few-shot learning field, we found that extensions of current few-shot methodologies and their applications to the NER task are currently starting to exponentially increase (refer to Figure 1 for an overview of the number of published articles in recent years). The need to explore few-shot learning in the context of NER emerges in the recent survey by Li et al. [76] as a consequence of the ineffectiveness of transfer learning, since due to the variations in language traits and annotated texts, consequently a model developed for one dataset may not perform well on texts from other datasets. Furthermore, the variety of theoretical methods to improve the model generalization ability for few-shot settings and the increasing volume of research impose the need to take stock of the current state-of-the-art to understand the value inherent in each method. However, to the best of our knowledge, at the current state-of-the-art there is only one benchmarking study from Huang et al. [57] that investigates three common schemes for few-shot learning extended and applied to NER: meta-learning, supervised pre-training with noisy data extracted from web sources and self-training with unlabeled samples. While the above-mentioned benchmarking study has an extremely high value to serve as a baseline for future works, it does not capture the wide and dynamic landscape of few-shot NER works.

Fig. 1.

In this survey, we start from a precise definition of Few-Shot Named Entity Recognition (FS-NER) and its contextualization in the current research scenario. Then, we propose a taxonomy to divide works into two macro-categories: model-centric and data-centric. The former focuses on model architectures, i.e., how to build and train models so that they can achieve high performance in few-shot settings; the latter focuses on data operations allowing to improve or increment the size of the available training corpora. In each sub-category, works are described in chronological order so as to provide insights on how later techniques address issues from earlier methods. Thus, we suggest promising directions for the further development of this research field.

The remainder of this work is structured as follows: Section 2 provides the definition of NER and the few-shot setting we are considering; Section 3 illustrates the algorithm taxonomy by first describing the key idea at the foundation of each category and then going more deeply to practical implementations in literature; Section 4 provides a benchmark study of 7 well-known few-shot learning methods for NER, while Section 5 illustrates final considerations and suggests paths for future work.

2 Few-Shot NER: What, Why, Where, and How

In this Section, we use some questions from the Kipling method² to provide an overview of FS-NER. Specifically, we provide answers for questions listed as follows:

—

What? (Problem definition) — in Section 2.1 we formalize the few-shot NER problem in the context of machine learning and few-shot learning sub-field. Contextually, we will outline differences making few-shot NER a more challenging task worth of an in-depth analysis.

—

Why? (The need for FS-NER) — we will provide a clear and concise example, which shows the reasons behind the hype toward few-shot NER in Section 2.2.

—

Where? (Applications) — in Section 2.3 we will describe real-world scenarios where FS-NER methodologies are needed.

—

How? (Taxonomy) — we propose a taxonomy to categorize works on FS-NER, which will be described and analyzed in Section 3 and Section 4.

2.1 Problem Definition

Few-Shot NER (FS-NER) is a sub-field of Few-Shot Learning (FSL), which in turn can be considered as a sub-area in machine learning. We will first describe foundation definitions and then we will provide a formal definition of FS-NER. Contextually, we will discuss relatedness and differences with other NLP tasks.

Definition 1 (Machine Learning [95, 97]).

A computer program is said to learn from experience E with respect to some classes of task T and performance measure P if its performance can improve with E on T measured by P.

Based on Definition 1, NER can be identified as the task \(T\) and \(E\) is a corpus of sentences annotated with entity mentions, so that a performance measure \(P\) (usually \(Precision\), \(Recall\) and/or \(F1\)) of the machine learning model can be improved. Formally, NER is defined as in Definition 2.

Definition 2 (Named Entity Recognition [76]).

Given a sequence of tokens (i.e., a sentence) \(s=[t_1, t_2, \dots , t_N]\), NER outputs a list of tuples \([I_s, I_e, t]\) representing named entities mentioned in \(s\). Here, \(I_s \in [1,N]\) and \(I_e \in [1,N]\) are the indexes of start and end characters of the named entity mention, while \(t\) is the entity type.

It should be noted that the definition provided above is restricted to continuous spans, which represents a frequent occurrence. However, during the application of NER to real-world tasks, named entities that are overlapping or situated in discontinuous text spans may arise. These scenarios have been extensively investigated in various studies [37, 98, 154].

As a result of its peculiar characteristics, NER is a token-level classification task, where models make their predictions token by token. It may be considered as the parallel task of image segmentation in computer vision, where classification is on a pixel-by-pixel basis rather then relying on the whole image. As a consequence, many approaches and model architectures proposed to deal with few-shot challenges in text classification are not directly applicable to NER, thus requiring further efforts, extensions, and/or adjustments.

Machine learning models usually need for big datasets with supervised information (a.k.a. ground-truth) to be effectively trained. The sub-area of FSL has the aim at obtaining good performance when only a small set of supervised data is available. Formally, it can be defined as in Definition 3.

Definition 3 (Few-Shot Learning (FSL) [157]).

FSL is a type of machine learning problems (specified by E, T, ansd P), where E contains only a limited number of examples with supervised information for the target T.

In concrete terms, FSL aims at learning a classifier able to predict a label \(y\) for each input \(x\) (e.g., image classification [81], text classification [143]). In the light of the definitions discussed above, we can provide a clear and concise definition for FS-NER as described in Definition 4.

Definition 4 (Few-Shot Named Entity Recognition (FS-NER)).

FS-NER is a sub-area of FSL where the machine learning problem specified by \(E\), \(T,\) and \(P\) is not only constrained by \(E\) containing a limited number of examples, but also by \(T\) being a NER task.

The substantial difference in FS-NER is that each input sentence \(\mathbf {s} = [t_1, t_2, \dots , t_N]\) has to be associated to multiple labels \(\mathbf {y} = [y_1, y_2, \dots , y_N]\), one for each token \(t_i \in \mathbf {s}\), by the few-shot model. In addition to the formal task difference, it is important to note that two tokens \(t_i \in \mathbf {s}\) (\(i \in [1,N]\)) and \(t_j \in \mathbf {s}\), (\(j \in [1,N]\)) are not independent but semantically related. As a consequence, NER works usually leverage their relatedness to build effective machine learning models, but this also means that many FSL works may be unsuitable for FS-NER. However, several efforts have been made during this years to adapt and extend FSL methodologies or to borrow key ideas. In this work, we will provide an in-depth analysis of how FSL works have been adapted to FS-NER.

2.2 The Need for FS-NER

Deep neural networks and Transformer architectures represent the foundation of contemporary NER techniques, which require little to no feature engineering to achieve cutting-edge benchmark performances. However, since they typically require enormous hand-labeled training sets, these gains are frequently challenging to apply in real-world situations. In FS-NER, the number of available training examples is small, thus impacting the reliability of the resulting NER model which, as a consequence, usually overfits data [157].

We empirically show this in Figure 2, where we report trends of F1 scores as the NER training epochs increase on validation sets from three widely used benchmark datasets: CoNLL-2003 [148], WNUT-17 [28], WikiANN (en) [107].³

Fig. 2.

We can observe that performance has a decreasing trend when the number of shots is too small — \(shots \lt = 20\) on CoNLL-2003 and WikiANN, \(shots \lt = 100\) on WNUT-17 data —, meaning that the model is overfitting on its training set. Moreover, even when the F1 score increases as the training progresses, it reaches low values due to the inability of the resulting model to recall all the ground-truth entity mentions contained in the test set. From these two challenges — i.e., reducing overfitting and improving performance of NER models when only few labeled examples are available — arises the need for FS-NER methodologies.

2.3 Applications

In broad terms, FS-NER is needed in every application scenario affected by the rareness of training samples, which is usually due to:

—

Scarcity of resources — there are many situations where the current state-of-the-art does not provide an available dataset to train our models. For example, most of the current NLP research is focused on 20 out of the 7,000 languages spread all over the world [90], which are thus inevitably disadvantaged with the unavailability of data sources.

—

Difficult data sharing — depending on the application domain, the possibility to share data and build training corpora may be a challenge. For example, owners of healthcare data, i.e., patients, may often hesitate to share their (sensitive) information for research or commercial purpose, despite the potential impact of such data.

—

Annotation costs — the manual labelling process of NER data is expensive. It usually requires several annotators and a standard protocol to minimize annotation conflicts. To obtain high-quality training corpora, not only does a significant amount of time have to be devoted to the annotation process, but domain knowledge is also often needed (e.g., healthcare, finance), thus increasing the overall costs.

Hence, many real-world applications involve FS-NER. However, current state-of-the-art does not offer many successful use cases yet — methods are usually tested on benchmark datasets. Wang et al. [159] experiment their few-shot method on a private corpus of 1,600 de-identified EHRs from cardiology, respiratory, neurology and gastroenterology deparments; Ni et al. [102] test their cross-lingual FS-NER approach on a custom multi-lingual dataset with over 50 entity types annotated to build cognitive question answering applications on top of the FS-NER system.

2.4 Taxonomy

In this section, we propose a taxonomy to categorize existing state-of-the-art FS-NER techniques, as shown in Figure 3. The first layer of the taxonomy is based on whether the methodology is focused on model architecture (model-centric) or data (data-centric), respectively. Input data sources and methodological flows change dependently on the technique, but we summarized high-level details in Figure 4.

Fig. 3.

Fig. 4.

Model-centric Methods.

In this setting, model architectures are designed to make the most of the few available training samples. Transfer learning approaches leverage model weights learned in another domain or language. Differently from transfer learning, which leverage knowledge from the same task but a different domain, Meta learning aims at building models able to quickly adapt to new tasks without the need to be re-trained from scratch.

Data-centric Methods.

The shift from model-centric to data-centric AI⁴ is ongoing and increasingly widespread. This can be justified by the fact that the astonishing improvements brought by deep learning models on the state-of-the-art of several AI tasks has led the research community to find ever more better models, but now that a performance plateau has been reached, efforts are being made to deal with the other important aspect of AI systems: data. In the context of FS-NER, we identified four common methods to deal with the lack of data:

—

Data Augmentation techniques leverage not only training samples but also external sources and unannotated data (when available) to increase the training corpus size

—

Active Learning aims at selecting the most informative samples to be annotated from an unlabeled corpus in order to optimize the tradeoff between performance and annotation costs.

—

Distant Supervision consists in leveraging external data sources and heuristics to provide “weak” labels to data from an unlabeled corpus.

—

Self Learning approaches use the model itself to provide a label to data from an unlabeled corpus.

In the following Sections, we are describing in details the discussed methodologies.

3 Model-centric Methods

In this section, we review model-centric FS-NER methods by separating them into two sections, as outlined in Figure 3. Hence, we line up methods as a story and summarize their key characteristics and discuss similarities and differences under each category as well as limitations that have not been addressed yet.

3.1 Transfer Learning

In all the fields of Machine Learning, Transfer Learning is the standard approach to deal with the lack of data. The knowledge — i.e., their learned parameters — of models trained on huge datasets is “adapted” with new training iterations to make it possible for the model to perform well with a target domain where there is a lack of resources. Current Transfer Learning methods for NER are mainly based on deep neural networks and Transformer architectures and usually leverage feature representation transfer [109], which makes the model learn to map inputs from different domains in a close feature space, and parameter transfer [168], which makes the target model parameters close to those of the source model. We divide transfer learning approaches for FS-NER in three categories: cross-domain, cross-lingual, fine-tuning. In the following, we describe them in detail.

3.1.1 Cross-domain Transfer.

In cross-domain Transfer Learning, we aim at transfering the knowledge from a source specialty (e.g., Electronic Health Records, a.k.a. EHRs, from the department of cardiology) to a target specialty (e.g., EHRs from the department of orthopaedics). Figure 5 presents an example of inputs that can be used to train a cross-domain transfer learning system. The goal of this system is to enhance performance in a distinct target domain, which may feature a dissimilar set of entity types. The corresponding output is also depicted in the figure. In the following, we describe the current state-of-the-art methods. Dependency Transfer \(\&\) Pair-wise Embedding [56] is a CRF-based method that integrates prior experience of token similarities and label dependencies. In particular, this work handles two challenges when learning emission and transition scores of CRFs: (1) the infeasibility to learn transition scores on the few in-domain labeled data or from source domain data due to discrepancies in label sets, and (2) the difficulty in calculating emission scores due to different meanings of words based on their contexts.

Fig. 5.

To handle the first problem, the dependency transfer mechanism is proposed. Formally, label dependencies for the transition scorer can be computed as the transition probability between two labels \(f_T(y_{i-1}, y_i) = p(y_i | y_{i-1})\) which are usually stored in a transition matrix \(\mathbf {M}^{N \times N}\), where \(N\) is the number of labels and \(m_{l_1, l_2} \in \mathbf {M}\) corresponds to \(p(y_i=l_2 | y_{i-1}=l_1)\). Given that in few-shot contexts we could face a new label which is not available in source domains, only three abstract labels are used: \(O\), \(B\), \(I\), and transition scores from \(B\) and \(I\) labels to another are computed by differentiating transitions to the same \(B\) (\(sB\)), different \(B\) (\(dB\)), same \(I\) (\(sI\)), different \(I\) (\(dI\)). The transition matrix will thus have 3 rows (\(O\), \(B\), \(I\)) and 5 columns (\(O\), \(sB\), \(sI\), \(dB\), \(dI\)). The second challenge is then faced by using a pair-wise embedder, which pairs the query sentence with all support sentences and uses a Transformer [150] to get paired representations that are different for each support sample.

Note that in this work the model is first trained on a set of source domains and then directly used with a set of unseen target domains without fine-tuning. The transfer of knowledge thus happens with the above-described methodology and not by re-training models.

Label-aware Double Transfer Learning (La-DTL) ⁵ [159] exploits Maximum Mean Discrepancy (MMD) [46] to reduce the discrepancy between feature representations of tokens with the same label coming from different sources. In particular, the goal is to improve performance on the target domain \(\mathcal {D}_t = \lbrace (\mathbf {x}_i, \mathbf {y}_i)\rbrace _{i=1}^{N^t}\) by leveraging knowledge from the source domain \(\mathcal {D}_s = \lbrace (\mathbf {x}_i, \mathbf {y}_i)\rbrace _{i=1}^{N^s}\), with \(N^s \gt \gt N^t\). Each input sentence is transformed into a sequence of embedding vectors and then fed into a BiLSTM network which encodes contextual information into a fixed-length vector and is shared between source and target models. The reason behind the choice of sharing the BiLSTM layers across domains lies in the poor ability of LSTM networks to generalize well without seeing enough data [173]. Label aware MMD is then computed to reduce the feature representation discrepancy between the two domains. It is a parametric test statistic to measure the distance between the kernel mean embeddings of two distributions, which is computed between the hidden representations from two domains with the same ground truth label \(y\) and minimized during training. Hidden representations are then fed to domain-specific CRF layers to predict the label sequence. CFR layers are not shared across domains to enhance target domain performance.

3.1.2 Cross-language Transfer.

A high number of FS-NER works focus on leveraging cross-lingual information to improve the model performance. Based on the availability of data, many transfer learning scenarios are possible, as shown in Figure 6. Most of the approaches rely on a projection-based transfer scheme [63, 102, 155, 169]: one side of bitext is annotated with a tagger for a high-resource language and then the annotation is projected over the bilingual alignments obtained through unsupervised learning [104]. Projected annotations are then used as weak supervision to train the tagger in the target language. However, paired sentences are not always readily available. In some cases, all that is available are individual sentences in a high-resource language, or a multilingual corpus that includes samples from multiple languages. In the following, we explore methods for transferring knowledge across languages in FS-NER tasks. Dandapat et al. [26]. The extreme scenario in cross-lingual transfer learning is when no labeled data is available in the target language (unsupervised transfer) [115, 163]. Dandapat et al [26] address the problem of unavailability of parallel data between the low-resource language (Hindi) and English with machine translation systems (i.e., Google Translate⁶). In particular, cross-lingual features are extracted from a resource-rich language by (1) aligning the low-resource language sentence and its translation, (2) applying a good performing model to the English sentence and (3) for each word in the low-resource sentence, (3.1) English word(s) that map to the source word are found thanks to the alignment function and (3.2) the feature vector of the mapped word, initialized to all zeros, is updated by adding 1 if its label matches with any of the labels in the feature vector. The quality of this approach strongly depends on the quality of the Machine Translation system.

Fig. 6.

Ni et al. [102] propose an annotation projection approach which receives parallel sentence pairs, where the English sentence serves as the source and the corresponding sentence in the low-resource language serves as its translation. Given a sentence pair \((\mathbf {x}, \mathbf {y})\), where \(\mathbf {x} = (x_1, x_2, \dots , x_s)\) and \(\mathbf {y} = (y_1, y_2, \dots , y_t)\), the projection procedure requires two steps: (1) first, an English NER model is applied to the English sentence \(\mathbf {x}\) to obtain a set of NER tags \(\mathbf {l} = (l_1, l_2, \dots , l_s)\); (2) then, NER tags are projected to the target sequence \(\mathbf {y}\) by leveraging alignment information. Note that this work considers “parallel” sentences and generates a weak set of projected data, which is then filtered to select good-quality samples to improve performance. While this annotation projection approach is data-centric (it may be reported under the Distant Supervision category), authors also propose a representation projection approach which allows to use the same NER model for both English and the target language by providing “universal” word embeddings as inputs. In particular, similarly to Mikolov et al. [92], given a target language \(f\), a dictionary containing English and target-language word pairs \(\lbrace (x_i, y_i, w_i)\rbrace _{i=1,\dots ,n}\) is generated, where \(x_i\) and \(y_i\) are the English and target-language word, respectively, and \(w_i = P(x_i | y_i)\) is a weight representing the relative frequency of \(x_i\) given \(y_i\). The training objective is to find a linear mapping \(\mathbf {M}_{f \rightarrow e}\) which projects target-language words to english:

\begin{equation} \mathbf {M}_{f \rightarrow e} = \text{arg min}_\mathbf {M} \sum _{i=1}^n w_i || \mathbf {u}_i - \mathbf {Mv}_i ||^2, \end{equation}

(1)

where \(\mathbf {u}_i\) and \(\mathbf {v}_i\) are the embedded representations for the English word \(x_i\) and the target-language word \(y_i\), respectively; weights \(w_i\) allow to provide a higher importance to more frequent pairs.

The problem of this approach is that the bitext assumption is often not present in the case of low-resource languages [20, 26].

Cotterell et al. [20] allow the notion of named entities to be shared across “genetically” similar languages by using a character-level neural CRF to abstract the notion of a named entity across similar languages. Differently from projection-based approaches, this strategy does not require a bi-text assumption. First, given a language label \(l\), a language-specific CRF \(p_\theta (\mathbf {y} | \mathbf {x}, l)\) is created. Transition between tags and the character-level neural network are shared between languages to allow knowledge transfer. Given a low-resource target language \(\tau\) and a high-resource language \(\sigma\), the training objective is:

\begin{equation} \mathcal {L}(\theta) = \sum _{(\mathbf {x}, \mathbf {y}) \in \mathcal {D}_\tau } log p_\theta (\mathbf {y} | \mathbf {x}, \tau) + \mu \cdot \sum _{(\mathbf {x}, \mathbf {y}) \in \mathcal {D}_\sigma } log p_\theta (\mathbf {y} | \mathbf {x}, \sigma), \end{equation}

(2)

where \(\mu\) is a tradeoff parameter, \(\mathcal {D}_\tau\) is the target-language dataset, \(\mathcal {D}_\sigma\) is the source-language dataset. If we have a set of \(m\) high-resource languages \(\lbrace \sigma _i\rbrace _{i=1}^m\) available, we can add a summand to the set of high-resource language datasets \(\mathcal {D}_{\sigma _i}\) used. Experiments prove that while this transfer approach has little or no effect on big datasets, it is extremely useful in few-shot scenarios (e.g., from \(F1=0.49\) to \(F1=0.76\) on a 100-shot Galician dataset with knowledge transfer from Spanish).

Bilingual Word Embedding Translation (BWET) ⁷ [163] uses Bilingual Word Embeddings (BWEs) to project two sets of embeddings into a consistent space with a small dictionary [92, 139] or in an entirely unsupervised manner (i.e., no labeled data is available for the low-resource language) by using adversarial training [179]. Specifically, four methodological steps are performed:

(1)

Separate word embeddings are trained for each monolingual corpus.

(2)

Word embeddings are then projected into a shared latent space with BWEs. Assuming we have a dictionary \(\lbrace \mathbf {x}_i, \mathbf {y}_i\rbrace _{i=1}^D\), where \(x_i\) and \(y_i\) are embeddings of a word pair, a mapping matrix \(W\) is computed by minimizing the following:

\begin{equation} \text{arg min}_\mathbf {W} \sum _{i=1}^D || W\mathbf {x}_i - \mathbf {y}_i ||^2, \end{equation}

(3)

which is equivalent to Equation (1) but without the notion of weights.

(3)

Each word in the low-resource language is translated by finding its nearest neighbor in the shared latent space. Cross-domain similarity local scaling (CSLS) metric [66] is employed to address the hubness problem of the shared latent space [31] (i.e., mapped words in the shared space may be near to many items that are “universal” neighbors of a large number of different mapped samples):

\begin{equation} CSLS(\mathbf {x}_i, \mathbf {y}_j) = 2cos(\mathbf {x}_i,\mathbf {y}_j) - r_T(\mathbf {x}_i) - r_S(\mathbf {y_j}), \end{equation}

(4)

where \(r_T(\mathbf {x_i}) = \frac{1}{|\mathcal {N}_T(\mathbf {x_i})|} \sum _{\mathbf {y_t} \in \mathcal {N}_T(\mathbf {x_i})} cos(\mathbf {x_i}, \mathbf {y_t})\) is the mean cosine similarity between \(\mathbf {x_i}\) and its neighborhood \(\mathcal {N}_T(\mathbf {x_i})\). The target word \(\mathbf {y}_j\) maximizing the CSLS value is then selected as the proper translation.

(4)

Translated words are used along with tags from the high-resource language corpus to train a NER model.

Massively Multilingual Transfer for NER (MMNER) ⁸ source code: https://github.com/afshinrahimi/mmner. [115] assumes to have a collection of \(H\) models trained in a high-resource setting from a different language from our target task. Showing that simply choosing one of the models or performing majority voting are inaccurate choices, the authors develop a generative model to learn the quality of models in the setting of zero annotations available in the target language, and use few annotations (when available) to find the posterior for parameters of a Bayesian inference model. Specifically, given transfer models predictions \(y_{ij}\), where \(i \in \lbrace 1,2,\dots ,N\rbrace\) is an instance and \(j \in \lbrace 1,2,\dots ,H\rbrace\) denotes one of the \(H\) available transfer models, the generative process assumes the true label \(z_i \in \lbrace 1,2,\dots ,K\rbrace\) being corrupted by each transfer model when producing predictions. Models assigning a high probability to the correct label are considered more reliable with respect to that label. For inference, mean-field variational Bayes [62] is employed.

3.1.3 Prompt-based Transfer.

Recent work in NLP has demonstrated the impressive gains obtainable with the pre-training and fine-tuning approach, especially when applied to transformer language models [150]. During the pre-training phase, a large unsupervised dataset is used to train the language model to find informative representations of inputs which can then be used to solve downstream tasks. Hofer et al. [53] show that pre-training on domain-specific corpora and reducing out-of-vocabulary words can significantly improve performance of NER models in few-shot settings. The number of publicly available pre-trained models in a variety of domains and languages is high and constantly increasing.⁹ Just to mention one example, BioBERT [70] pretrains a BERT-based language model on PubMed abstracts and PMC full-text articles to apply the advancements of transformers for biomedical text understanding. However, how to replicate these contributions in low-resource languages, where also unsupervised text data is difficult to obtain, is an open challenge. Bondarenko et al. [10] fine-tune a BERT language model pre-trained on russian data (RuBERT) to adapt it to NER and Relation Extraction. Schneider et al. [129] transfer learned the information encoded in a multilingual BERT model to a corpora of clinical narratives and biomedical scientific articles in Brazilian Portuguese. Reimers et al. [119] propose a knowledge distillation based approach to extend existing sentence embeddings models to new languages.

Recent work leverages prompts to exploit the knowledge acquired by such architectures during the pre-training phase by re-phrasing the task to a masked language modeling task which is closer to the target NER task. Figure 7 shows the workflow of PromptSlotTagging [55] as an example. In the following, we describe methods that propose the use of prompts to improve FS-NER performance. Conversational Value Extractor (ConVEx) [52] is an efficient pre-training and fine-tuning approach which has its foundation in the fact that a stronger alignent between a pretraining task and an end task can yield performance gains [44, 73]. A pairwise cloze pre-training procedure is proposed, which is more closely related to the target slot-labeling task and facilitates training all the necessary layers for slot-labeling, so these can be fine-tuned rather than learned from scratch. Specifically, the model receives a template sentence along with the input sentence: a keyphrase (which is in common between the two sentences) in the template sentence is masked out, and the model has to predict which tokens in the input sentence constitute the keyphrase.

Fig. 7.

Pattern-Exploiting Training for NER (PETER) [42] extends Pattern-Exploiting Training(PET) [127], a prompt-based method originally designed for sentence classification. PETER reformulates the fine-tuning task by using cloze-questions to provide task descriptions which enable the model to leverage the knowledge it acquired during the pre-training phase. In particular, given a sequence of tokens \(\mathbf {x}\), PETER follows the procedure described below:

—

\(|\mathbf {x}|\) new input examples are generated, one per token in the input sentence \(t \in \mathbf {x}\). The generation of new samples is based on a pattern \(P(\mathbf {x})\) which manipulates them with the structure: “\(\mathbf {x}\). In the sentence above, the word \(t\) refers to a [entityType] entity”.

—

A language model is trained to assign a binary label to transformed samples indicating its truthfulness.

—

Unlabeled data may be employed to generate a soft-labeled dataset from predictions of the previously-trained model(s). This can be thus considered as an hybrid approach which also uses self-learning (see Section 4.4 to handle the FS-NER problem.

Template-based NER ¹⁰ [22] approach is similar to PETER, but it also handles entity n-gram word spans. The language model used is fixed to BART and several patterns (a.k.a. templates) are experimented: (1) “\(\mathbf {x}\). candidate span is a entity-type entity.”; (2) “\(\mathbf {x}\). The entity type of candidate span is entity type.”; (3) “\(\mathbf {x}\). candidate span belongs to entity type category.”; (4) “\(\mathbf {x}\). candidate span should be tagged as entity type.”. For each template, also a “none-entity” version is generated (e.g., the first template becomes “\(\mathbf {x}\). candidate span is not a named entity.”). In their experiments, authors declare that the best template is the first one, but an ensemble approach leveraging all the templates achieves even higher performance.

PromptSlotTagging ¹¹ [55] handles a major issue of the previously-described prompt-based NER approaches [22, 42]: while they have shown a consistent improvement in FS-NER performance, their downside lies in a lack of scalability, i.e., the number of transformed samples explodes as the quantity of available data increases, thus implying high training and prediction times. To speed up the overall process, inverse prompt are proposed to reversely predict slot values given entity types. For example, given the entity type “arrival” and a sentence \(\mathbf {x} = \text{``book a flight from Beijing to New York tomorrow morning''}\), it may be transformed as: “\(\mathbf {x}\). Arrival refers to ___.”, and then the language model learns to decode multi-word spans (“New York” in this example). Specifically, at training time a pre-trained language model is fine-tuned with answered prompts, and cross-entropy loss is computed only on answer tokens, not on the whole sentence.

3.1.4 Performance Evaluation.

Experimental results reported in the reviewed articles are summarized in Table 1. The quality of FS-NER systems is heavily dependent on the domain application: results on the SNIPS dataset [21] covering a variety of areas (i.e., weather, music, playlist, book, search screen, restaurants, creative work), in particular, were noticeably better than those on the GUM dataset [175] extracted from Wikipedia. Moreover, the results suggest that the selection of training data has a significant impact on the performance of FS-NER models. Techniques that employ more sophisticated criteria for selecting data for training simulations outperform those using a simple random selection, such as PETER [42]. This finding emphasizes the importance of careful data selection and highlights the potential of advanced data selection techniques to improve few-shot learning performance.

Table 1.

The results also show how FS-NER systems can facilitate the acquisition of generalizable knowledge. The ConVEx technique [52], which exploits knowledge from pre-trained language models and significantly enhances the performance on SNIPS data [21], provides an example of how leveraging pre-existing knowledge can enhance FS-NER performance. This finding supports the notion that pre-training on large amounts of data can provide a foundation for acquiring transferable knowledge that can be applied to new tasks. Furthermore, the superior performance of PromptSlotTagging [55] compared to Template-based NER [22] highlights the potential of advanced prompt engineering techniques to improve FS-NER performance.

3.2 Meta-learning

Inspired by human intelligence, meta-learning (a.k.a. learning to learn) [128, 147] aims at quickly adapting models to new tasks without the need to re-train them from scratch and with only few data points. As humans, we are able to easily learn new skills after a few minutes or zero experience: for example, if we can ride a bike, we will easily learn to ride a motorcycle. This can be accomplished by ML models with a meta-learning phase during which the model learns to adapt to a large variety of tasks.

3.2.1 Background.

The basic meta-learning process can be described as follows: (1) during the meta-training phase, the meta-learner is trained in multiple episodes, each constituted by a support and a query set (equivalent to the training and test sets when considering a standard classification task); (2) in the meta-test phase, the knowledge acquired during meta-train is transferred to the target task by using its training data to train the classifier for each episode. The support/query set split of the meta-training phase is designed to simulate the training/test splits which will be encountered during the meta-test phase.

Applications of meta-learning to few-shot scenarios can be categorized in metric-based methods [140, 152], memory-based methods [118, 126] and optimization-based methods [35, 64]. In metric-based methods, models consists of two consecutive parts: (1) an embedding function, which focuses on learning transferable embeddings, and (2) a classifier, which identifies the correct label given the defined metric (e.g., distance, relation). In memory-based techniques, a neural network is usually trained to learn how to store and retrieve “memories” to use for downstream tasks. Finally, optimization-based algorithms focus on the optimization algorithm for the meta-training phase.

Model Agnostic Meta-Learning (MAML) [35], Prototypical Networks [140] and Relation Networks [145] are the basis of most of the current applications of meta-learning to FS-NER. The objective of MAML is to train a model so that it can easily learn a new task from a small amount of data. Differently, Prototypical Networks assume that points of the same class cluster around a single prototype representation in an embedded space, and thus few-shot classification can be done by finding the nearest prototype. Relation Networks propose a learnable non-linear module which outputs relation scores over element-wise sum of support and query features. In the following, we briefly describe the above-mentioned standard approaches and then we will dive into NER extensions.

Model-Agnostic Meta Learning (MAML). MAML first forms a set of training tasks \(\mathcal {T}=\lbrace \mathcal {T}_1, \mathcal {T}_2 \dots \mathcal {T}_n\rbrace\), each task \(\mathcal {T}_i\) being represented by a training and a validation set. For each task, model parameters updates are computed with gradient descent:

\begin{equation} \theta _i^{\prime } = \theta - \alpha \nabla \mathcal {L}_{T_i}^{train}(f_\theta), \end{equation}

(5)

where \(\alpha\) is a universal learning rate (i.e., shared across tasks) and \(\mathcal {L}_{T_i}\) is the training-loss related to the \(i\)th task. Meta-optimization objective is to optimize model performance across all the tasks:

\begin{equation} \min _{\theta } \sum _{\mathcal {T}_i} \mathcal {L}_{T_i}^{val}(f_{\theta ^{^{\prime }}_i}) = \sum _{\mathcal {T}_i} \mathcal {L}_{T_i}^{val}(f_{\theta ^{^{\prime }}_i}) \end{equation}

(6)

Hence, the validation error on sampled tasks \(\mathcal {T}_i\) serves as the training error of the meta-learning process. Model weights are updated as follows:

\begin{equation} \theta _i^{\prime } = \theta - \beta \nabla \sum _{\mathcal {T}_i} \mathcal {L}_{T_i}^{val}(f_{\theta ^{^{\prime }}_i}), \end{equation}

(7)

where \(\beta\) is the learning rate of meta-optimization. In Figure 8 we show MAML working at meta-test time: each task \(\tau _i\) has an optimal parameter \(\theta _i^*\), and adjusting the parameter along the gradient \(\nabla \mathcal {L}_i\) yields a parameter \(\theta _i^{\prime }\) that is expected to be close to \(\theta _i\). Prototypical Networks. Differently, Prototypical Networks are trained so that the outputs of the last but one layer allow an easy separation of instances into meaningful clusters, i.e., outputs are similar for objects belonging to the same class and diverse for objects of different classes. In this way, it is possible to assign a class to unseen instances even with few samples. More specifically, they consider the available support set for class \(k\) to compute an \(M\)-dimensional representation \(c_k \in \mathbb {R}^M\), named prototype, as the mean vector of the embedded support points (as shown in Figure 9):

\begin{equation} \frac{1}{|S_k|}\sum _{\mathbf {x}_i \in S_k} f_\phi (\mathbf {x}_i), \end{equation}

(8)

where \(S_k = \lbrace \mathbf {x}_1, \dots , \mathbf {x}_i, \dots \rbrace\) is the support set of class \(k\) and \(f_\phi : \mathbb {R}^D \rightarrow \mathbb {R}^M\) is the embedding function with learnable parameters \(\phi\).

Fig. 8.

Fig. 9.

The class assignment for the query set is then based on a softmax over distances to the prototypes in the embedding space:

\begin{equation} p_\phi (y=k | \mathbf {x}) = \frac{exp (-d(f_\phi (\mathbf {x}), c_k)}{\sum _{k^{\prime }} exp (-d(f_\phi (\mathbf {x}), c_{k^{\prime }})} \end{equation}

(9)

Parameters \(\phi\) are trained by episodes, each generated by randomly sampling a subset of classes from the training set, then choosing a subset of samples for each chosen class to act as the support set. Then, negative log-likelihood of the true class \(k\) \(J(\phi) = - log p_\phi (y=k | \mathbf {x})\) is minimized with stochastic gradient descent.

Relation Networks. Similarly to Prototypical Networks, this approach maps data in an embedded space, but it further defines a relation classifier instead of relying on a fixed metric. More specifically, given the embedding function \(f_\phi : \mathbb {R}^D \rightarrow \mathbb {R}^M\), a feature map for each class \(c_k\) (analogous to the prototype in Prototypical Networks) is generated by performing an element-wise sum over the feature maps \(f_\phi (\mathbf {x_i})\) of all samples \(\mathbf {x_i}\) from each training class. The class feature map \(c_k\) and the embedding \(f_\phi (\mathbf {x_j})\) of a query sample \(\mathbf {x_j}\) are then combined with an operator \(\mathcal {C}(c_k, f_\phi (\mathbf {x_j}))\) (e.g., concatenation), and the combined feature map is fed into the relation module \(g_\lambda\) which produces the relation score \(r_{kj}\), i.e., a scalar value representing the similarity between the two embeddings, \(r_{kj} = g_\lambda (\mathcal {C}(c_k, f_\phi (\mathbf {x_j}))\). Parameters (i.e., \(\phi\), \(\lambda\)) are learned by regressing relation scores to the ground truth with mean squared error (MSE) loss:

\begin{equation} \sum _{k=1}^K \sum _{j=1}^n (r_{kj} - \mathbf {1}(y_j == k))^2, \end{equation}

(10)

where \(K\) is the number of classes, \(\mathbf {1}(y_j == k)\) equals to 1 when the label \(y_j\) of the query sample \(\mathbf {x}_j\) corresponds to class \(k\), 0 otherwise. An example of Relation Network for a 5-way 1-shot classification problem is shown in Figure 10.

Fig. 10.

3.2.2 Applications to Few-shot NER.

While having being widely applied for few-shot image classification [36, 78, 116, 158], only recently meta-learning attempts have been made in NLP applications. Gu et al. [47] used meta-learning in neural machine translation, adapting the model to low-resource languages. Huang et al. [58] applied MAML to the query generation task. Qian and Zhou [112] proposed DAML, which learns general and transfereable information by combining multiple dialog tasks during training. Lin et al. [103] use meta-learning to generate personalized responses by leveraging just a few dialog samples.

In the following, we describe some peculiarities of works adopting meta-learning or its variants to solve the FS-NER problem.

ProtoNER ¹² [39] is the first work to adapt Prototypical Networks to NER tasks. The standard approach is modified as follows:

—

Situations where words from the same sentence fall in different support and query sets is prevented, so as to not break the sentence structure, given that words in sentence influence each other.

—

Since words associated to class O (outside of an entity mention) should not be necessarily neighbors in the embedded space, the similarity score for the O class is replaced with a scalar value (treated as an hyper-parameter).

—

Two training sets are used: out-of-domain and in-domain. The former is quite large and refers to a different class w.r.t. NER target, the latter is a small dataset of labeled examples of the target class. The base model is first trained on the out-of-domain training set and its weights are then used to initialize the Prototypical Network.

MetaNER [74] is the first work to extend a MAML-based approach to sequence labeling. It decompose the NER process into two modules: a sequence encoder and a tag decoder. Data from available tasks \(\mathcal {T}=\lbrace \mathcal {T}_1, \mathcal {T}_2 \dots \mathcal {T}_n\rbrace\), where \(\mathcal {T}_i\) contains annotated data \((\mathcal {X}_i, \mathcal {Y}_i)\) (\(\mathcal {X}_i\) being the sequence of tokens and \(\mathcal {Y}_i\) the corresponding sequence of NER tags) are used to train a meta-knowledge learner for the sequence encoder, while new unseen domain data \(\mathcal {T}_{new}\) is used to fine-tune the learned sequence encoder and a new tag decoder. An adversarial network is used to improve the generalization of the model.

Krone et al. [65] propose a less-restrictive extension of Prototypical Networks w.r.t. ProtoNER, since it does not require a separate network for each named entity type. Prototypical Networks are here applied to jointly perform intent classification and slot filling. Let \(S_l=\lbrace (x_l^i, t_l^i, y_l)\rbrace\) be the support set instances with intent class \(y_l\) and \(S_a=\lbrace (x^i_{[1:j]}, t^i_{[1:j]}, y^i | t_j^i=a\rbrace\) be the the support set of sub-sequences with slot-label \(t_j^i=a\). Intent class prototypes \(c_l\) and slot-label prototypes \(c_l\) are separately computed as follows:

\begin{equation} c_l = \frac{1}{|S_l|}\sum _{(x_l^i, t_l^i, y_l) \in S_l} f_\phi (x^i) , \end{equation}

(11)

\begin{equation} c_a = \frac{1}{|S_a|}\sum _{(x^i_{[1:j]}, t^i_{[1:j]}, y^i) \in S_a} f_\phi (x^i_{[1:j]}). \end{equation}

(12)

As in standard prototypical networks, a softmax over distances to prototypes in the embedding space is leveraged to assign classes. Parameters are learned by minimizing a composed loss as the sum of intent-classification and slot-filling negative log-likelihoods. Slot-filling loss is computed as the sum of negative log-likelihoods for each token in the sequence.

Label-enhanced Task-Adaptive Projection Networks (L-TapNet) ¹³ [54] combines the advantages of taking into account the similarity of query samples to class representative samples, which is a fundamental characteristic of Prototypical and Relation networks, with the performance improvement brought by taking label dependencies [60, 89] and the semantic relations between label names and slot words (for example, the word rain and the label name weather are highly related) into account. Specifically, given a support set \(S\) and a query sentence \(\mathbf {x}\), a linear-CRF is applied and thus label sequence class assignment is done as follows:

\begin{equation} p_\phi (\mathbf {y} | \mathbf {x}, S) = \frac{exp (\mathcal {T}(\mathbf {y}) + \lambda \mathcal {E}(\mathbf {y}, \mathbf {x}, S))}{\sum _{\mathbf {y}^{\prime } \in \mathbf {Y}} exp (\mathcal {T}(\mathbf {y}^{\prime }) + \lambda \mathcal {E}(\mathbf {y}^{\prime }, \mathbf {x}, S)} \end{equation}

(13)

In the above-reported equation, \(\mathcal {T}(\mathbf {y}) = \sum _{i=1}^n f_\tau (y_{i-1}, y_i) = \sum _{i=1}^n p(y_i | y_{i-1})\) is the output of a Transition scorer which captures the dependencies between labels in a matrix \(\tau ^{N \times N}\), where \(N\) is the number of labels. Since source domain data used in the meta-training phase may have different label sets w.r.t. target domain data, a collapsed dependency transfer mechanism [56] is employed by modeling labels with a higher level of abstraction: \(B\) (beginning), \(I\) (inside) and \(O\) (outside) labels are used to represent all slot words, and transitions are modeled by differentiating transition from the same or a different entity type (e.g., \(p(B_{person} | I_{location})\) is the transition probability from a \(I\) label to a \(B\) label with different entity types).

In Equation (13), \(\mathcal {E}(\mathbf {y})\) is the output of an Emission scorer. To compute the emission score of a word \(f_\mathcal {E}(y_i, \mathbf {x}, S) = p(y_i | \mathbf {x}, S)\), an extension of TapNet [172] is proposed. Differently from Prototypical Networks, TapNet projects data in an embedded space where words with different labels are well-separated to reduce misclassifications thanks to three key elements: an embedding network \(f_\theta\), a set of class reference vectors \(\mathbf {\Phi } = [\mathbf {\phi _1}, \dots , \mathbf {\phi _N}\), and a mapping \(\mathbf {M}\) of embedded features to a new classification space, which is constructed such that the embedded feature vectors and the class reference vectors with matching labels align closely. Both the embedding network parameters \(\theta\) and the reference set \(\mathbf {\Phi }\) are update episode after episode with a softmax based on Euclidean distance between the mapped query sample and reference vectors. L-TapNet enhances this framework by finding a projector \(\mathbf {M}\) that aligns query samples not only to class reference vectors but also label semantic representations \(\mathbf {s} = [s_1, \dots , s_N]\) computed with a BERT transformer network. The label-enhanced reference \(\tau _j\) is thus computed as \(\tau _j = (1-\alpha)\cdot \phi _j + \alpha s_j\), where \(\alpha\) is a balancing parameter.

Oguz et al. [105] extend Relation Networks for NER by including a learnable attention module. Differently from L-TapNet [54], authors suggest to not use slot descriptions or slot names but small amounts of annotated samples from different domains as training inputs due to two main reasons: (1) slot descriptions require qualified linguistic expertise and (2) there is not always a relation between slot names and the corresponding tokens. To handle the scenario where a NER task has to be solved with unseen slot labels, the same meta-learning procedure as Matching Networks [152] is applied to guarantee a robust episodic training: each episode is constructed by randomly selecting \(C\) unique classes and \(K\) labeled examples for each class for support \(S=\lbrace (x_i, y_i)\rbrace ^m_{i=1}\) and query \(Q=\lbrace (x_i, y_i)\rbrace ^m_{i=1}\) sets, with \(K\gt 1\) and \(m=K*C\). Each episode is divided in an embedding stage producing feature maps from \(S\) and \(Q\) with contextual embeddings such as ELMO and BERT, and a relation module which calculates the relation scores between support and query, thus assigning the label of the most relational value as a label of query. A learnable attention module inspired from Jetley et al. [61] is included to highlight the relevant and suppress the misleading samples between support and query samples, thus learning how to attend local and global features.

3.2.3 Performance Evaluation.

We summarize the experimental results of FS-NER techniques based on meta-learning in Table 2. The table shows that there is a significant discordance in the choice of annotation schemes, use of validation sets, and datasets to be utilized. This shows that there does not exist unambiguous agreement among researchers regarding the most effective methods to develop FS-NER models. The only comparison that can be made is between L-TapNet [54] and the methods used by Oguz et al. [105] on the SNIPS dataset [21] in a 5-shot scenario. The results show that, despite not using slot descriptions or slot names, Oguz et al. [105] technique surpasses L-TapNet [54] in terms of performance. However, we also note that the two techniques appear to use different annotation schemes, which makes it difficult to draw fair conclusions.

Table 2.

Method	Schema	Dev	20	100%	20	100	20	100	1	5	10	15	20	100
			OntoNotes	ACE2005	ATIS		TOP		SNIPS
ProtoNER [39]	IOB	✔	65.63	—	—	—	—	—	—	—	—	—	—	—
MetaNER [74]	BIOES	✔	—	76.21	—	—	—	—	—	—	—	—	—	—
Krone et al. [65]	IO	✖	—	—	40.90	41.28	28.27	28.33	—	—	—	—	59.73	62.14
L-TapNet [54]	IOB	✔	—	—	—	—	—	—	70.41	75.01	—	—	—	—
Oguz et al. [105]	IO	✖	—	—	—	—	—	—	—	82.7	86.3	87.5	—	—

Table 2. Overview of Evaluations Conducted in the Reviewed Articles Included in the Meta Learning Category

When datasets have multiple entity types, we report averaged scores. Details of the datasets are reported in Appendix. The column Schema reports the annotation schema used, while Dev indicates whether a development set to choose hyperparameters has been used or not. For each dataset, we report the number of shots or the percentage of training data and the results obtained when available.

3.3 Summary and Discussion

Since when deep learning has started to proliferate, model architectures have been the focus of research. In the context of FS-NER, works are based on transfer or meta learning. The two approaches, while similar in their objective (i.e., improving the performance in the presence of scarse data resources), are slightly different in the mode in which they operate: while transfer learning uses a model trained with data from a similar domain or language to shift its knowledge to another model, meta learning focuses on training procedures allowing models to quickly adapt to new tasks.

When a model trained on a similar task is available, transfer learning can be easily applied without too many adaptation, and guarantees high-quality results, especially with transformer architectures. However, this ideal scenario is quite uncommon, since real-world task may be similar but deal with another language, or other entity types, thus requiring many adaptations which may also limit the resulting performance — e.g., multi-lingual transfer usually requires a machine-translation step whose error inevitably propagate across the transfer-learning framework. To overcome this, meta-learning models easily adapt when new tasks emerge (e.g., a new entity type is required to be recognized).

4 Data-Centric Methods

The ever-increasing attention toward Data-Centric AI [108] is reflected in a high number of FS-NER works focusing on data to improve performance. In this section, we review data-centric FS-NER approaches by separately focusing on data augmentation, distant supervision, active learning and self learning, as outlined in Figure 3. Hence, we line up methods as a story and summarize their key characteristics and discuss similarities and differences under each category as well as limitations that have not been addressed yet.

4.1 Data Augmentation

One common way to deal with the lack of data is data augmentation, which consists in increasing the size of the available dataset with new samples generated by means of heuristics or external data sources. Augmentation methods explored in current literature for NLP tasks usually manipulate words in the original sentence by word replacement [13], random deletion [160], word position swap [93], and generative models [171]. Applying these transformations to NER input samples is not possible due to the token-level classification implied by this task (each manipulation impacts labels). Thus, data augmentation techniques for NER are comparatively less studied [25]. In the following, we describe current methods applying data augmentation in few-shot scenarios. We provide an example of an augmented sample for each method in Figure 11. Data Boost [85] explores the text generation ability of Language Models to generate augmented samples. In particular, GPT-2 [113] is used as a conditional generator and guided toward specific class labels by means of a Reinforcement Learning (RL) approach. A RL stage is added in between the softmax and argmax function of the conditional generator. The state at step \(t\) is the generated sentence before \(t\) \(s_t = \mathbf {x}_{\lt t}\), where \(\mathbf {x}_{\lt t} = \lbrace x_0, x_1, \dots , x_{t-1}\rbrace\); the policy \(\pi _\theta\) is the probability that token \(x_t\) is chosen (action \(a_t\)), i.e., the softmax output of the hidden states \(\pi _\theta (a_t|s_t) = softmax(h^\theta _{\lt t})\). Reward, following Proximal Policy Optimization (PPO) [130] for a given conditional token \(x_t^c\) is computed as follows:

\begin{equation} R(x_t^c) = \mathbb {E}\left[\frac{\pi _{\theta _c} (a_t|s_t)}{\pi _{\theta } (a_t|s_t)}\cdot G(x_t^c)\right] - \beta \cdot KL(\theta || \theta _c), \end{equation}

(14)

where \(G(x_t^c)\) is the salience gain which quantifies the generated token resemblance to target label lexicon by a logarithm summation of cosine similarity with each word in the salient lexicon; KL is the Kullback-Leibler (KL) divergence between conditional \(\theta _c\) and unconditional \(\theta\) distributions; \(\beta\) is a weighting parameter. However, this method has been conceived for text data augmentation, and does not directly apply to NER, as it does not provide the mapping of the entity mention from the original to the augmented sentence.

Fig. 11.

Counterfactual Generator ¹⁴ [176] addresses the poor generalization ability of few-shot systems to spurious correlations between the entities and their contexts. For example, in the sentence “John lives in New York”, the entity “New York” and its context (“John lives in”) are highly correlated, but it is not true that one causes the other. Thus, based on the idea that entities and contexts are not in causal relationships, Zeng et al. [176] provide a framework to generate weakly-labeled counterfactual examples in few steps: (1) a vocabulary with the desired entity type is prepared, (2) an entity is randomly selected from the input sentence and (3) replaced with a different entity from the entity set to form a counterfactual example; finally, (4) a discriminator model is trained to distinguish good examples from counterfactuals: if the discriminator, i.e., a NER model, can correctly recognize the replaced entity, the counterfactual example is considered in the new training set.

Dai et al. [25] investigate the adaptation of many simple data augmentation methods for NER problems. They are listed as follows:

—

Label-wise Token Replacement (LwTR): we randomly decide if a token from the input sentence has to be replaced. If yes, it is randomly replaced with another available token with the same label.

—

Synonym Replacement (SR): similar to LwTR, but tokens are replaced with synonyms retrieved from WordNet.

—

Mention Replacement (MR): similar to LwTR, but only applied to mentions, which are randomly replaced with other mentions from the original training set with the same type.

—

Shuffle within Segments (SiS): sentences are split into segments with the same label, and tokens are randomly shuffled within these segments.

Experimental results show that all the investigated data augmentation techniques are effective in few-shot settings (e.g., 50- and 100-shot scenarios), but they often fail on full datasets due to the degrading effects of noise.

COSINER [7] is a more sophisticated variation of Mention Replacement (MR) [25]. It employs the cosine similarity to replace mentions with those that are the most similar, as opposed to randomly replacing them. Concept embeddings \(V_{concept}\) extracted from each mention in the training set are used to calculate similarity and computed as follows:

\begin{equation} V_{concept} = V_{concept} + lr\cdot (1-sim)\cdot V_{context}, \end{equation}

(15)

where

\begin{equation*} lr = \frac{1}{C_{concept}} \end{equation*}

and

\begin{equation*} sim = \max \left(0,\frac{V_{concept}}{||V_{concept}||}\cdot \frac{V_{context}}{||V_{context}||}\right), \end{equation*}

\(V_{context}\) being the embedding extracted from a transformer network used as a feature extractor. The similarity-based nature of this method shows improvements in performance w.r.t. techniques based on random replacements.

StyleNER ¹⁵ [18] learns patterns (e.g., style, noise, abbreviations) that differentiate data from high-resource and low-resource domains, and a shared space where they are aligned. The key idea is that text data may differ in textual patterns (e.g., long and complex sentences from research abstracts versus social media data), but their semantics are transferable. Given a source domain \(D_{src}\) and a target domain \(D_{tgt}\) dataset, sentences are linearized by inserting entity labels before their corresponding text span, and a random pair of sentences is then extracted from the two datasets and provided as input to the model. The model works in two steps:

—

Denoising reconstruction: the model learns input embeddings from their corresponding domain. Inputs are perturbed in several ways (e.g., shuffling, dropping, masking) to inject noise, and the model has to capture semantics and learn patterns that differentiate sentences across domains. Then, a decoder reconstructs noisy sentences to their corresponding domain.

—

Detransforming reconstruction: based on their semantics, sentences are transformed from one domain to the other. Then, the model generates embeddings for transformed sentences and learns to reconstruct each of them in their corresponding domain.

A discriminator is also trained to distinguish the domain of an embedded sentence, which allow to determine if the encoder can generate meaningful representations or the model has to bypass the intermediate mapping step between domains.

4.1.1 Performance Evaluation.

The experimental results summarized in Table 3 offer empirical proof to back up the claim that using data augmentation methods consistently results in performance gains. Moreover, the data suggest that limited contexts, such as those encountered in few-shot scenarios, can benefit even more from data augmentation. It is worth noting, however, that comparing the effectiveness of various data augmentation techniques across different datasets is difficult. Nevertheless, the results obtained on OntoNotes¹⁶ by ProtoNER [39] presented in Table 2 suggest that meta-learning outperforms data augmentation techniques, achieving better performance with only 20 training samples. This finding suggests that while data augmentation alone is probably worse than model-centric approaches, combining data augmentation with specially designed methods for few-shot learning may lead to even better performance.

Table 3.

Method	Schema	Dev	Shots	Dataset	F1	+%
Counterfactual Generator [176]	IOB	✔	100	CNER	47.6	0.4
				IDiag	67.9	9.6
				CLUENER	34.8	4.6
Dai et al. [25]	IOB	✔	50	MaSciP	71.2	3.1
				i2b2-2010	41.5	6
COSINER [7]	IOB	✖	108 (2%)	NCBI-Disease	69.2	4.1
			91 (2%)	BC5CDR	83.2	4
			251 (2%)	BC2GM	66.5	2.1
StyleNER [18]	IOB	✔	1000	OntoNotes + T-Twitter	59.21	0.78

Table 3. Overview of Evaluations Conducted in the Reviewed Articles Included in the Data Augmentation Category

When datasets have multiple entity types, we report averaged scores. The last column (\(+\%\)) reports the improvement of performance obtained by augmenting the original dataset.

4.2 Distant Supervision

In few-shot scenarios, labeled data could be retrieved from heuristics, different domains or laguages, external knowledge bases, or ontologies. Distant supervision [94] aims at leveraging such resources to heuristically annotate training data. For example, in biomedicine there are a lot of curated resources available: NCBO Bioportal [161] houses 541 biomedical ontologies, Medical Subject Headings(MeSH)¹⁷ is a controlled vocabulary with 347,692 classes of medical items, and so on. Combining ontologies is a difficult task due to their heterogeneous structures, concept granularities and overlaps or conflicts between definitions of entities. Generally, the main steps of distant supervision are (1) candidate generation, i.e., the identification of potential entities, and (2) labeling heuristics to generate noisy labels, as shown in the example of Figure 12. The use of distant supervision for FS-NER in low-resource languages has yet to be deeply explored. The amount of external information available in low-resource settings might be very limited: for example, the Wikipedia knowledge graph contains 4 million person names in English while only 32 thousand in Yorùbá [1]. Furthermore, without further tuning under better supervision, distantly supervised models have low recall [14]. In the following, we describe methods that apply distant supervision to NER tasks. SwellShark ¹⁸ [38] aims at automatizing the generation of candidates and noisy labels without hand-labeled data. Its inputs are (1) a collection of unlabeled documents and (2) some form of weak supervision, typically ontologies and heuristic rules. The candidate generator \(\Gamma _\Theta\) is defined as the mapping function from a document collection \(D\) into a candidate set \(\Gamma _\Theta : D \rightarrow \lbrace x_1, x_2, \dots , x_N\rbrace\), where each candidate \(x_i\) is a character-level span within the document. Candidates in SwellShark are determined by heuristics, e.g., dictionary of noun phrases matching with regular expression. To filter actual entities from candidates, a labeling function generator \(\Gamma _\lambda\) is defined as a function which receives a resource \(R\) for weak supervision (e.g., ontology, term frequencies) and generates labeling functions \(\Gamma _\lambda : R \rightarrow \lbrace \lambda _1, \lambda _2, \dots , \lambda _N\rbrace\) (e.g., lexicon matching, frequency-based thresholds).

Fig. 12.

Choosing heuristics intrinsically involves a tradeoff between development time and performance: greedy heuristics such as dictionary matching often imply low recall, while noun phrases candidates generate large sets which usually consist in an increase of noise, and hand-tuned heuristics may indeed result in high performance, but require more efforts.

AutoNER ¹⁹ [134] handles the problems of incompleteness and noise in automatically-labeled NER data, which characterize methods based on the detection of entity spans with heuristic rules, such as regular expressions [38, 120] and exact string matching [43, 51]. Nevertheless, unmatched tokens in are simply ignored when entities are not covered by the used ontology, introducing many false-negative labels (incompleteness). Furthermore, these methods often require expensive expert effort to cover many special cases. To handle these problems, AutoNER marks some high-quality out-of-dictionary phrases as “potential entities” without requiring human effort. In particular, to leverage the information embedded in dictionaries, AutoNER proposes the tie or break tagging scheme, which tags two adjacent tokens as (1) tie if they belong to the same entity type, (2) unknown if at least one of them belongs to an unknown-type phrase, and (3) break otherwise. This scheme is used for entity span detection, while entity types are then identified based on feature vectors of candidates. Predicting whether two adjacent tokens refer to the same entity or not, AutoNER builds more robust distant supervised models (since it is often noisy on boundaries of entity mentions but not in their inner ties).

One of the main advantages of AutoNER is that entity mentions may be marked as unknown, allowing us to include token which we are unable to identify their types based on distant supervision. For example, “prostaglandin synthesis” may be present in both disease and chemical lexicons, or lexicon may not cover all the possible entity types. AutoPhrase [133] is used to automatically mine new phrases from unlabeled text and a dictionary of high-quality phrases. All the out-of-dictionary phrases are then labeled as unknown and added to the dictionary.

Yang et al. ²⁰ [167] also addresses the incomplete and noisy annotation problems. Weak sentences are obtained by using a dictionary \(D\) built from named entities available in the training dataset \(H\) to weakly annotate a large unlabeled pool of sentences. Note that in this work “distant” resources are used to augment the training set, but the dictionary is built by relying on the available data only, which may be a limit in highly-constrained few-shot settings, where training data do not cover all the possible named entities. To handle the incomplete annotation problem, sentences are allowed to have partial annotations: some token spans are annotated with definite labels, while all the others are associated to all the possible labels. The noisy annotations problem is handled with a reinforcement learning approach which follows Feng et al. [34] to obtain clean instances from distantly supervised NER data. In particular, the state \(\mathbf {S_t}\) is a vector containing: (1) representation of the current sentence obtained with a BiLSTM layer, (2) label scores computed with a MLP layer from the shared encoder, and (3) distant annotation of the instance. The action \(a_t \in \lbrace 0,1\rbrace\) indicates whether to select the \(t\)th distantly supervised sentence, and the policy is learned by optimizing NER performance (reward).

Knowledge-Augmented Language Model (KALM) [80] augments a traditional language model with a knowledge base without requiring any additional component; in addition, it learns to recognize entities in an entirely unsupervised way by using entity type information which is latent in the model. In particular, KALM has the ability to predict masked words from a vocabulary \(V_g\) like any other language model, but it has a separate vocabulary \(V_i\) for each entity type and is able to predict whether to expect an entity from the context. Formally, given a latent variable \(\tau _i\) denoting the entity type \(i\) and previously observed words \(c_t = [y_1, y_2, \dots , y_{t-1}, y_t]\), the probability of the next word is computed as follows:

\begin{equation} P(y_{t+1}|c_t) = \sum _{j=0}^K P(y_{t+1}, \tau _{t+1}=j | c_t) = \sum _{j=0}^K P(y_{t+1} | \tau _{t+1}=j, c_t) \cdot P(\tau _{t+1}=j | c_t), \end{equation}

(16)

where \(P(y_{t+1} | \tau _{t+1}, c_t)\) is the distribution of entity words of type \(\tau _{t+1}\) and \(P(\tau _{t+1}=j | c_t)\) is the probability that the next word has a given type \(j\). Both are computed as in standard language models, i.e., by projecting the hidden state of a LSTM model and normalizing with a softmax. However, the base model is enhanced by using as input not only the embedding vector of an input word, but also the embedding of the type of the previous word: in this way, KALM is able to model context by taking count of entity types, and it allows to learn latent types more accurately during training.

Cao et al. ²¹ [14] try to maximize the potential of automatically weakly labeled data (e.g., anchors from Wikipedia) by dealing with the incomplete and noisy annotations problems. To obtain a high recall, the framework generates as many weakly-labeled data as possible with a label induction approach assigning labels to words based on Wikipedia anchors and taxonomy, and a data selection scheme which computes a scoring function to distinguish high-quality data from weak sentences and a neural model is then trained on such data. For data selection, the same approach as Ni et al. [102], which is based on annotation confidence and coverage, is applied. For sequence labeling from high-quality weakly-labeled data, Partial CRFs [146] are employed, while a classification module regards name tagging from noisy weakly-labeled data as a multi-label classification problem predicting each word label separately. The first layers of the two network are shared so as to allow knowledge distillation.

Graph Attention Model ²² [87] uses a domain-specific 2dictionary to create a word formation graph which captures variants of entities and thus discovers as new entity mentions as possible. In particular, the vertex set contains all words in the dictionary, which are connected by undirected edges if they appear in the same entity type. Mention candidates are then extracted with a graph-matching algorithm. After extracting all the candidates, a word-mention graph integrates word formation of candidate entities into their sentences: the vertex set contains words in the input sentence and mention candidates extracted. In this way, links between words allows us to capture contextual information, while word-mention links capture the semantic of mentions. Graph information is then leveraged by a learning model including (1) a word embedding layer to represent words, (2) a BiLSTM layer to capture contextual information, and (3) a graph attention network (GAT) [151] to incorporate the information of mention candidates.

Linked HMM ²³ [123] claims that the standard approach of generating candidate spans and then independently labeling each candidate, as in SwellShark [38], limits the applicability to tasks where candidate generators exist, and increases human efforts. Furthermore, candidate generators should identify all the possible entity spans, since errors propagate in the pipeline. Hence, Linked HMM allows users to write multiple rules which provide partial tags to sequences, whose accuracy is estimated by using an identifiable probabilistic generative model without labeled training data. The estimated posterior distribution is used over the true tags to train a sequence tagger. The rules which can be provided are categorized into two types: (1) tagging rules, which vote on the correct tags of sequence elements (they take a training input sequence and output a sequence of the same length indicating their votes on the true tags); (2) linking rules, which vote on whether an adjacent element should have the same or different tag (they allow distant supervision to propagate along sequences in an user-controlled way). These two types of rule are then used by a linked Hidden Markov Model to estimate the true tags for training by reconciling incomplete and conflicting information from multiple rules provided by users.

Consensus Network ²⁴ [67] can be trained on imperfect annotations from multiple sources (e.g., crowd annotations, cross-domain data) by learning representations for each source and dynamically aggregating them by a context-aware attention mechanism. This is based on the intuition that different sources of supervision may have different strengths based on scenarios where they are applied. The framework first uses a multi-task learning schema based on a BiLSTM-CRF network to decouple model parameters into a set of annotator-invariant model parameters and a set of annotator-specific representations, then trains a context-aware attention module for a consensus-based representation by combining predictions on the target data. Specifically, scores from the annotator \(k\) are obtained by combining emission and transformation score matrices from the BiLSTM-CRF network with annotator-dependent matrix \(\mathbf {A}^{(k)}\) which represents the pattern of annotation bias, i.e., the entry \(\mathbf {A}^{(k)}_{ij}\) is the probability of assigning the wrong label \(j\) instead of the correct label \(i\). Scores from different annotators are then combined with weighted voting, where the weight is given by F1 scores on the training set, and an attention module is added to improve generalization, since it provides more weight to sources which are more related to the input sentence.

4.2.1 Performance Evaluation.

The experimental findings summarized in Table 4 demonstrate that distant supervision techniques can yield results that are comparable in quality to gold labeling at a significantly lower cost. The development of annotation functions, heuristics, dictionaries, or rules can serve as a cost-effective alternative to time-consuming and expensive labeling processes, even though some human effort may still be required. Moreover, combining weak labels with dictionaries has been found to be more effective than relying on partial training data or techniques such as data augmentation, as can be seen by comparing results obtained on biomedical datasets (i.e., NCBI-Disease [32] and BC5CDR [77]) with those presented in Tables 3, 5. Furthermore, the results obtained using the KALM method [80] for CoNNL [125] demonstrate its superiority over transfer-learning methods (refer to Table 1) that rely on a limited number of labeled samples. In fact, KALM [80] yields outcomes that are almost as accurate as those generated using gold labels as shown in Table 5, hence highlighting the efficiency of weak labeling methods in minimizing labeling costs while retaining optimal quality.

Table 4.

Method	Schema	Dev	Annotation effort	NCBI-Disease	BC5CDR	LaptopReview	NEWS	CoNLL	W2019-language	W2019-food	AMT
SwellShark [38]	IOB	✖	Labelling functions	67.1	83.7	—	—	—	—	—	—
			+ Specialized candidate generator	80.8	83.85	—	—	—	—	—	—
AutoNER [134]	Tie or Break	✖	Dictionaries	75.52	84.8	65.44	—	—	—	—	—
Yang et al. [167]	IOB	✔	Partial annotations	—	—	—	79.22	—	—	—	—
KALM [80]	IOB	✖	Dictionaries	—	—	—	—	89.0	—	—	—
Cao et al. [14]	IOB	✔	—	—	—	—	—	—	86.14	70.1	—
Graph Attention [87]	IOB	✔	Dictionaries	89.41	91.72	—	—	—	—	—	—
Linked HMM [123]	IOB	✖	Rules	79.03	82.96	69.04	—	—	—	—	—
Consensus Network [67]	IOB	✔	Crowdsourcing	—	—	—	—	—	—	—	79.99

Table 4. Overview of Evaluations Conducted in the Reviewed Articles Included in the Distant Supervision Category

When datasets have multiple entity types, we report averaged scores.

Table 5.

Method	Schema	Dev	0	5	10%	100%	sub	100%	10%	100%	10%	100%	10%	100%	0	0	5	10%	0	0	5	10%	5	10%	5	10%	5	10%	5	10%	5	10%	5	10%	5	10%
			CoNNL					WSJ	Sci-ERC		BC5CDR		NCBI-Disease		Twitter	OntoNotes			Webpage	WikiGold			WNUT17		MIT-M		MIT-R		SNIPS		ATIS		Multiwoz		I2B2
LM-LSTM-CRF [84]	BIOES	✔	—	—	—	91.71	—	97.53	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Chen et al. [15]	N/A	✔	—	—	—	—	84.95	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
RDANER [174]	N/A	✔	—	—	—	—	—	—	58.83	68.96	78.25	87.38	78.14	87.89	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
BOND [79]	IOB	✖	81.48	—	—	—	—	—	—	—	—	—	—	—	48.01	68.35	—	—	65.74	60.07	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—	—
Huang et al. [57]	IOB	✔	—	65.4	89.5	—	—	—	—	—	—	—	—	—	—	—	71.1	86.7	—	—	68.4	75.9	37.6	50.5	55.9	66.6	51.3	74.1	83.0	94.2	90.5	90.3	22.5	84.1	39.3	87.1

Table 5. Overview of Evaluations Conducted in the Reviewed Articles Included in the Self Learning Category

4.3 Active Learning

Active Learning aims at selecting informative sets of examples for training, actively querying the user for labels, as shown in the cycle depicted in Figure 13. The most common approach is uncertainty sampling [72], in which the model selects examples based on the uncertainty of its predictions (a general approach uses entropy as an uncertainty measure [136]). While its theoretical properties have been extensively studied in past works [4, 6, 9, 27], Active Learning approaches are recently spreading in natural language processing tasks. Zhang et al. [180] are the firsts to investigate active learning for sentence classification: they use the Expected Gradient Length(EGL) [132] as an active learning strategy aiming to select the instances which would result in the maximum change in the current model parameter estimates if their labels were provided. In the following, we will describe current applications of Active Learning to NER tasks.

Fig. 13.

Shen et al. [135] are the firsts to explore AL methods on Deep Neural Networks (DNNs) for the NER task. Due to the complexity of DNNs, the model is not retrained in each active learning round, but it is incrementally trained with each batch of new labels. The work uses uncertainty-based sampling strategies by considering three ranking methods: Least Confidence (LC) [23], Maximum Normalized Log-Probability (MNLP), and Bayesian Active Learning by Disagreement (BALD) [41]. Specifically, LC uses the probability of not predicting the most confident sequence from the model to sort samples in descending order:

\begin{equation} 1 - \max _{y_1, \dots , y_n} P(y_1, \dots , y_n | \mathbf {x}), \end{equation}

(17)

where \(y_1, \dots , y_n\) is the sequence of slot labels and \(\mathbf {x} = \lbrace x_i\rbrace\) is the sentence, i.e., a sequence of tokens \(x_i\), \(i \in \lbrace 1, \dots , n\rbrace\). Since this approach naturally favours long sentences (which is a downside since they require more effort to be annotated), MNLP is proposed:

\begin{equation} \max _{y_1, \dots , y_n} \frac{1}{n} \sum _{i=1}^n log P(y_i | y_1, \dots , y_{n-1}, \mathbf {x}) \end{equation}

(18)

Finally, BALD uses a set of M models \(P^1, P^2, \dots , P^M\) sampled from the posterior. Uncertainty is then computed as the fraction of models which disagreed on the most popular choice:

\begin{equation} \frac{1}{n} \sum _{j=1}^{n} 1 - \frac{\max _y | \lbrace m: \text{arg max}_{y^{\prime }} P^m(y_i=y^{\prime })=y |}{M}, \end{equation}

(19)

where \(|\cdot |\) is the cardinality of a set.

Monte Carlo dropout [40] is used to sample from the posterior. Results show that the performance of the best model trained in a standard supervised fashion is almost reached with approximately 30% of training data.

4.4 Self-training

Self-training, also referred to as self-learning, is an approach similar to distant supervision, where the model is trained on examples labeled by the model itself. As shown in Figure 14, the difference is that the labeling heuristics is replaced by the model itself. Originally proposed by Scudder et al. [131], it is one of the earliest semi-supervised methods. In the NLP field, it has been successfully applied to neural machine translation [50] and sentence classification [99]. Zoph et al. [183] show that self-training guarantees improvements in performance in both high- and low-data scenarios, while data augmentation results sometimes in decreases of performance of pretraining. In the following, we explore self-training approaches for FS-NER. LM-LSTM-CRF ²⁵ [84] leverages the knowledge obtained with a language model trained in an unsupervised fashion to improve sequence-labeling performance. It leverages word-level and character-level information of input-samples in a co-training fashion, i.e., each input is represented by different sets of features, each of them providing complementary information. Both a language model and a sequence-labeling model share the same character-level layer, which consists in two LSTM units learning character-level information in a completely unsupervised way to capture style and structure of input texts. However, since the two task handled are not strongly related, empirical results show that this can hurt the overall performance. Hence, outputs of character-level layers are transformed into task-specific spaces, so that the language model can indirectly provide its knowledge to the NER model thanks to the shared layer, without sharing its feature space.

Fig. 14.

Chen et al. [15], considering that performance of self-training highly depends on how new data are selected, use a reinforcement learning to learn how to select instances from an unlabeled pool to be added to the training set in self-training scenarios. The NER model, which had been previously trained with few samples, will classify the new selected samples and be retrained. The self-training approach is considered as a decision process, described as a function that receives the self-labeled instances as inputs and outputs the acceptance or rejection of such instances to the training set. A Deep Q-Network (DQN) [96] then learns the selection strategy based on performance improvements on a development set. The Q-function is implemented using a neural network with three layers which receives the state \(s = (h_s, h_c, h_p, h_t)\) of the learning framework as input, where \(h_s\) is a representation of the input sample, \(h_c\) is the confidence of tagging the instance using the model, \(h_p\) is the marginals of the prediction, \(h_t\) is the hidden representation from the model. The Q-function \(Q^\pi (s,a)\rightarrow R\) receives the state \(s\) and an action \(a\) as inputs and returns a reward as a result of the execution of \(a\). The policy \(\pi\) has the aim at maximizing the reward of actions. In this work, the reward is defined as the difference in NER performance when adding a new sample (i.e., the action) to the training dataset. This is a similar setting to Fang et al. [33], who apply DQNs to find the best Active Learning policy.

RDANER ²⁶ [174] uses a bootstrapping approach to obtain model predictions on easily-obtainable unlabeled data and retrain the model with the augmented weak dataset. Firstly, a general domain pre-trained language model \(\mathcal {M}\) is fine-tuned on the few-shot labeled corpus available \(\mathcal {D}^L\). Then, \(\mathcal {M}\) is used to annotate an unsupervised dataset \(\mathcal {D}^U\) to obtain a weakly annotated dataset \(\mathcal {D}_{weak}^U\) which is then combined with \(\mathcal {D}^L\) to get the augmented corpus. A threshold \(\theta\) is used to filter out tags with low probabilities assigned by \(\mathcal {M}\). This process is iterated until the achievement of an acceptable level of accuracy or the maximum number of iterations. Experiments show that this simple approach has worse F1-scores than domain-specific pre-trained language models (e.g., BioBERT [69], SciBERT [8]), but it could be a good alternative in low-resource languages and domains where huge domain-specific language model variants are not available.

BOND ²⁷ [79] combines a self-training approach with distant supervision. During the first stage, distant labels are generated with external knowledge bases and a pre-trained BERT model is adapted to the distantly supervised NER task with early stopping. In the second stage, a teacher-student framework is employed, where the student model \(\theta ^{stu}\) is trained with pseudo-labels generated by the teacher model \(\theta ^{tea}\). The teacher is initialized with weights \(\hat{\theta }\) learned during the first stage, while the student may be initialized with \(\hat{\theta }\) or pre-trained BERT layers \(\theta ^{BERT}\). At the \(t\)-th iteration, the teacher generates weak labels which the student learns to fit. Then, teacher and student are updated \(\theta ^{tea}_{(t+1)} = \theta ^{stu}_{(t+1)} = \hat{\theta }^{tea}_{(t)}\). In this way, pseudo-labels are progressively refined so that the student can exploit their knowledge and avoid overfitting.

Huang et al. ²⁸ [57] leverage the labeled training set \(\mathcal {D}^L\) and all the available in-domain unlabeled samples \(\mathcal {D}^U\) by resorting to the knowledge-distillation approach proposed by Xie et al. [164] in image classification. The algorithm operates in three steps: (1) a teacher model \(\theta ^{tea}\) is learned with labeled tokens \(\mathcal {D}^L\); (2) the teacher model is used to generate soft labels \(\mathbf {\hat{y}_i}\) on unlabeled tokens \(\mathbf {x}_i \in \mathcal {D}^U\), \(\mathbf {\hat{y}_i} = f_{\theta ^{tea}}(\mathbf {x}_i)\); (3) a student model \(\theta ^{stu}\) is trained on labeled and unlabeled tokens with a composed Kullback-Leibler loss \(\mathcal {L}\):

\begin{equation} \frac{1}{|\mathcal {D}^L|} \sum _{\mathbf {x}_i \in \mathcal {D}^L} \mathcal {L}(f_{\theta ^{stu}}(\mathbf {x}_i), \mathbf {y}_i) + \frac{\lambda ^U}{|\mathcal {D}^L|} \sum _{\mathbf {\hat{x}}_i \in \mathcal {D}^U} \mathcal {L}(f_{\theta ^{stu}}(\mathbf {\hat{x}}_i), \mathbf {\hat{y}}_i), \end{equation}

(20)

where \(\lambda ^U\) is a weighting hyper-parameter. Experimental results show that self-training usually results in significant performance improvements also when combined with noisy supervision.

4.4.1 Performance Evaluation.

The experimental results shown in Table 5 highlight that the self-learning techniques discussed are the most commonly approaches used in zero-shot contexts among the ones reviewed in this work. However, due to the heterogeneity of the datasets, comparing techniques belonging to this category against each other is not always feasible. However, we can examine the results of BOND [79] on the CoNNL dataset [125] in a zero-shot context, which can be compared with those achieved by KALM [80], which attains superior performance thanks to the usage of dictionaries. In addition, the results obtained by BOND [79] in a zero-shot context are superior to those achieved by Hou et al. [56] in the Transfer Learning category as shown in Table 1. We also observe that BOND [79] achieves better performance in a zero-shot context as compared to ProtoNER [39], despite the latter method utilizing 20 instances for training as shown in Table 2. Similarly, RDANER’s results on BC5CDR can be compared with COSINER in Table 3. Despite COSINER [7] utilizing fewer training data (i.e., 2% of the original dataset), it presents better performance than RDANER [174]. However, when dealing with 10% of NCBI-Disease [32], RDANER [174] performs better than COSINER [7].

Huang et al. [57] results for MIT-M [83] and MIT-R [82] can only be compared with those of Template-based NER [22] and PromptSlotTagging [55] in a 10-shot context. Huang et al [57] model obtains superior performance on MIT-M [83] with half the number of training samples and inferior results compared to Template-based NER [22] on MIT-R [82]. Furthermore, Huang et al. [57] model displays a superior performance on ATIS [49] than Template-based NER [22]. On SNIPS [21], the results outperform both Hou et al. [56] (Table 1) and L-TapNet [54] and Oguz et al. [105] (Table 2). On i2b2-2010 data [149], superior results as compared to Dai et al. [24] confirming that data augmentation alone has a lower effect on the performance of the model w.r.t. to other few-shot learning approaches.

The results show that the size and diversity of the training data, as well as the specific techniques utilized, have significant effects on the model’s performance in zero-shot or few-shot contexts. These findings suggest a need for further research to investigate how to optimize the combination of training techniques, data variety, and model architecture to achieve improved results in these circumstances.

4.5 Summary and Discussion

Data-centric approaches for FS-NER try to make the most of the few available training samples or to leverage in the cleanest way possible an available but unannotated corpus, thanks to the intervention of humans, external resources, and/or the model itself.

When the available few-shot training corpus is the one and only source of information available, data augmentation can be usually applied since it increases the size of the dataset by transforming the available samples. However, the majority of research work assumes the presence of external resources such as a vocabulary of entities [176] or synonyms [25].

In general, an unannotated corpus is the key to achieve better results in few-shot contexts — methods basically differ on how they handle it to find greedy annotations. Active learning usually requires one (or many) human (s) to optimize the tradeoff between the annotation efforts required and the resulting performance. On the other hand, distant supervision leverages external resources, such as heuristic rules and ontologies, to obtain weak labels. Similarly, self-learning gets weak labels by the model itself, which could be particularly useful when using language models to leverage the knowledge they acquired during the pre-training stage.

5 Challenges and Research Directions

In this section, we provide a discussion on the key aspects and research directions for the further development of the FS-NER state-of-the-art. Specifically, we critically evaluate the limitations of current FS-NER systems, identifying partially solved problems, as well as challenges that have yet to be addressed. Furthermore, we draw on a comprehensive examination of recent advancements in FS-NER and highlight the emerging trends and technologies that are gaining prominence.

5.1 Practical Applicability: How Many Shots do I Need?

As we can see in Figure 2, the number of training examples needed for few-shot learning to attain good performance might vary across different datasets and tasks. The complexity of the task, the diversity of the instances in the training set, and the similarity between the training and test sets are only a few of the factors that may contribute to this variability.

To learn effectively, tasks that are more complicated or call for a higher degree of abstraction would need more training instances. Several examples could be required to learn the pertinent aspects, for instance, if the task entails identifying complicated patterns or mastering sophisticated decision rules.

The model may be able to acquire more generalizable features and need fewer examples to perform well if the training set includes a broad group of examples that span a wide range of variations. The model could need more examples if the training set is wider or more homogeneous so that it can learn enough variation.

Additionally, the similarity between the training and test sets can also play a role. If the test set is very different from the training set, the model may need more examples to generalize to new examples in the test set. On the other hand, if the test set is similar to the training set, the model may be able to achieve good performance with fewer examples.

Understanding the factors that affect the number of training examples required for FS-NER learning can help researchers develop new approaches to improve model performance in low-data regimes. While one potential area of research could focus on developing methods that can learn more complex and abstract representations from few examples, we argue that another possible direction could involve exploring ways to generate more diverse training sets, such as using generative models to augment existing examples or incorporating external knowledge sources to provide additional training data. While generative models and distant supervision have been already investigated in FS-NER, their ability in generating heterogeneous training sets has yet to be explored.

5.2 Performance Evaluation

Despite the growing number of research studies on FS-NER, comparing their performance is a challenge due to the lack of a unified evaluation protocol. With regards to the datasets used to experiment methods, we have identified a set of benchmark corpora used in most of the works: CoNLL-2003 [148], SNIPS [21], Ontonotes [111], WikiANN [107].

However, research in this field has not reached a consensus yet on the three aspects described below:

—

Hyper-parameters choice. When building FS-NER systems, it should be kept in mind that the lack of data should also affect the development set on which hyper-parameters could be optimized. Hence, as in previous works [42, 127], we suggest to choose hyper-parameters based on literature and practical considerations, rather than on experiments made with a development set.

—

Annotation scheme. There are many annotation schemes to build NER datasets, each with its advantages and disadvantages: the IO scheme is the simplest one and allows to distinguish tokens that are at the inside and outside of entity mentions, but it fails to detect consecutive but different entities; to solve this, the IOB scheme adds a B label identifying the beginning of entity mentions, while the IOBES scheme enriches the information about boundaries by adding E and S labels to distinguish ending tokens and single-token entities, respectively, which could be useful to detect nested entities. However, the choice of the annotation scheme has a strong impact on the resulting performance of NER systems. In the FS-NER area, despite the lack of an in-depth study on their differences, the most adopted annotation schemes are IO and IOB, probably due to their simplicity. Nested entities are indeed rare and quite difficult to have available in a few-shot training corpus, but this means that the identification of nested entities is not possible with current FS-NER methods, which could be an interesting focus of future research.

—

Few-shot scenarios. FS-NER experiments are usually run by simulating scenarios with scarce resources. The most two common approaches consist in taking into consideration a pre-defined number of training samples (e.g., 1-shot, 10-shots, 50-shots experiments), or a pre-defined percentage of the original training set (e.g., 1%, 10%). Since the second approach may be confusing due to the fact that original training sets usually have a different size, we suggest future research to rely on the first one. Related to this, we suggest future works also to study the impact of different shots choices: since performance is strongly dependant on the training examples used to train the model, it may heavily vary based on which examples we select, thus hindering the robustness of experimental conclusions.

5.3 Applications

Although there is a growing body of research on FS-NER, the applicability of the proposed techniques in real-world scenarios where the rareness of training samples is involved has never been studied in-depth. Current works experimenting their techniques with real use cases usually focus on healthcare: Wang et al. [159] use Chinese EHRs data from an affiliated hospital with 1600 de-identified clinical notes from departments of cardiology, respiratory, neurology and gastroenterology; La Gatta et al. [42] experiment with an Italian dataset of cardiological clinical notes. However, we argue that the need for FS-NER methodology is more widespread and future works should also focus to other domain (e.g., finance). To the best of our knowledge, the only other work using FS-NER approaches to solve a real-world problem is from Ni et al. [102] who create a custom multi-lingual dataset (i.e., Japanese, Korean, German, Portuguese) with over 50 entity types to build cognitive question-answering applications.

5.4 The Ever-increasing Performance of Generative Models

The progress of Natural Language Processing (NLP) has been propelled by the creation of generative Large Language Models (LLMs), which can be seen in studies such as Devlin et al. [30], Brown et al. [11], and Chowdhery et al. [19]. LLMs have already become a crucial component of many commonly used products, such as the coding assistant Copilot [17], the Google search engine,²⁹ and the more recent addition of ChatGPT.³⁰ However, it is not immediately clear how these models may impact future research works in few-shot Named Entity Recognition (NER).

One potential way that generative models could impact FS-NER is by generating synthetic training data. Synthetic data can be used to augment limited training datasets, which is particularly useful in few-shot learning scenarios. By using generative models to generate synthetic examples, it may be possible to improve the ability of FS-NER models to generalize to new, unseen entities. StyleNER [18] (described in Section 4.1), is a first attempt to use generative models for FS-NER, but it relies on external data to learn their patterns and generate new samples that imitate their style. This poses many challenges on the choice of the external data to be used and their effectiveness on the few-shot problem at hand.

Another way that generative models could impact FS-NER is through their ability to generate context-aware representations. Named entities are often highly dependent on the context in which they appear, and the ability to generate context-aware representations may be useful for FS-NER models. By leveraging the power of generative models, it may be possible to learn more nuanced representations of named entities and their associated contexts.

Overall, there is still much work to be done to fully investigate the possible effects of generative models on FS-NER. Yet, it is evident that these models have a lot of potential for future advancements in FS-NER and other natural language processing tasks. The optimal techniques to utilize generative models in the context of few-shot learning will need to be determined through thorough experimentation and review, as with any new technology.

5.5 A Focus on Developmental Approaches

While standard procedures to train deep-learning models have shown robust performance improvements over time, there is an increasing research interest toward developmental approaches [64, 106, 165, 182, 184]. Smith et al. [137] show that two-year old children can infer a new category from only one instance. This is presumably due to the fact that the human brain, during early learning, is trained to develop foundational structures to learn the learning procedure [68], which is similar to the key-idea behind meta-learning. Instead of learning from scratch, neural networks should be able to recognize regularities from past tasks to help them solving new tasks [64]. Due to these parallelisms, current research in developmental psychology could guide future works in FS-NER. For example, Orhan et al. [106] and Xing et al. [165] leverage experiments showing that learning object names can change the visual features children use for word learning, thus recognizing that language helps recognizing new visual objects.

6 Conclusion

This survey provides a comprehensive review of state-of-the-art algorithms for few-shot Named Entity Recognition. We analyze and discuss this application field in detail and propose a taxonomy to summarize existing techniques into two macro-categories: model-centric and data-centric. According to the way models are defined and data are manipulated to handle the few-shot learning task, we further categorize different methods in each macro-category into subgroups. Algorithms in each subgroup are analyzed and lined-up as a story to highlight advantages, disadvantages and to help readers understand how research is moving toward future directions. Not only does our categorization and analysis offer a clear and comprehensive understanding of existing methods in the field, but it also provides useful resources for practitioners to select the most suitable technique suiting their needs, and for researchers to advance the state-of-the-art on few-shot NER.

Footnotes

https://scholar.google.com/

https://projectofhow.com/methods/the-kipling-method/

For the sake of reproducibility, we report here details for this experiment. Models and datasets have been downloaded from the HuggingFace repository [162]. For each dataset, we fine-tuned a BERT base (cased) network [30] for ten epochs with learning rate set to \(\text{2e-5}\) and weight decay to 0.01. All the other hyper-parameters are set to default. Reported F1 scores have been computed on validation sets. Training sets of few-shot experiments are obtained by randomly sampling from original training corpora.

⁴

https://datacentricai.org

⁵

La-DTL [159] code: https://github.com/felixwzh/La-DTL

⁶

https://translate.google.com

⁷

BWET [163] source code: https://github.com/thespectrewithin/cross-lingual_NER

⁸

MMNER [115].

⁹

https://huggingface.co/models

¹⁰

Template-based NER [22] source code: https://github.com/Nealcly/templateNER

¹¹

PromptSlotTagging [55] source code: https://github.com/AtmaHou/PromptSlotTagging

¹²

ProtoNER [39] source code: https://github.com/Fritz449/ProtoNER

¹³

L-TapNet [54] source code: https://github.com/AtmaHou/FewShotTagging

¹⁴

Counterfactual Generator [176] code: https://github.com/xijiz/cfgen

¹⁵

StyleNER [18] code: https://github.com/RiTUAL-UH/style_NER

¹⁶

https://catalog.ldc.upenn.edu/LDC2013T19

¹⁷

https://www.nlm.nih.gov/mesh/meshhome.html

¹⁸

SwellShark [38] code (Snorkel project): https://github.com/snorkel-team/snorkel

¹⁹

AutoNER [134] code: https://github.com/shangjingbo1226/AutoNER

²⁰

Yang et al. [167] code: https://github.com/rainarch/DSNER

²¹

Cao et al. [14] code: https://github.com/zig-kwin-hu/Low-Resource-Name-Tagging

²²

Graph Attention Model [87] code: https://github.com/yx100/GAT-BiLSTM-CRF

²³

Linked HMM [123] code: 3https://github.com/BatsResearch/safranchik-aaai2020-code

²⁴

Consensus Network [67] code: https://github.com/INK-USC/ConNet

²⁵

LM-LSTM-CRF [84] code: https://github.com/LiyuanLucasLiu/LM-LSTM-CRF

²⁶

RDANER [174] code: https://github.com/houking-can/RDANER

²⁷

BOND [79] code: https://github.com/cliang1453/BOND

²⁸

Huang et al. [57] code: https://github.com/few-shot-NER-benchmark/BaselineCode

²⁹

https://blog.google/products/search/search-language-understanding-bert/

³⁰

https://openai.com/blog/chatgpt/

A Benchmark Datasets

In Table 6 we report details about the datasets used to assess the performance of methods reviewed in this work.

Table 6.

Dataset	Ref.	Year	Description
ACE2005	link\(^{1}\)	2006	Text documents in multiple languages, including English, Arabic, and Chinese, along with annotations for named entities, their attributes, and relations between them.
AMT	[122]	2014	Crowd-annotation dataset based on the 2003 CoNNL shared NER task collected using Amazon’s Mechanical Turk.
ATIS	[49]	2016	Audio recordings and corresponding manual transcripts about airline travel information dialogues with 17 unique intent categories.
BC2GM	[138]	2008	Abstracts from the MEDLINE database annotated with gene entity mentions.
BC5CDR	[77]	2016	PubMed articles annotated with chemical-disease relations.
CLUENER	[166]	2020	Text samples from social media, news articles and other sources annotated with person names, locations, organizations, as well as entity types such as time and quantity.
CM-NER	[159]	2018	De-identified EHRs from four different specialties (i.e., cardiology, respiratory, neurology and gastroenterology).
CNER	link\(^{2}\)	2019	Chinese clinical NER dataset including anatomy, disease, imaging examination, laboratory examination, drug and operation entities.
CoNLL	[125]	2003	News articles annotated with person names, organization and locations.
GUM	[175]	2017	Conversations, news articles, fiction, academic papers linguistically annotated at multiple levels (part-of-speech tags, syntactic parse trees, named entities and coreference chains).
I2B2-2010	[149]	2011	Clinical notes annotated with various medical entities including diseases, symptoms, treatments and tests.
ICON 2013	link\(^{3}\)	2013	Documents in Indian languages (Bengali, Hindi, Marathi, Tamil and Telugu) annotated with some predefined categories of interest such as Person, Location, Organization, Miscellaneous (e.g., date, time).
IDiag	[177]	2020	Health record images converted into text paragraph by OCR and then annotated with diagnoses.
LaptopReview	[110]	2014	Focuses on laptop aspect term (e.g., “disk drive”) recognition.
MaSciP	[100]	2019	Synthesis procedures annotated with synthesis operations and their typed arguments (e.g., Material, Synthesis-Apparatus).
MIT-M	[83]	2013	Samples annotated with named entity types related to movies (Actor, Award, CharacterName, Director, Genre, Opinion, Origin, Plot, Quote, Relationship, Soundtrack, Year).
MIT-R	[82]	2013	Samples annotated with named entity types related to restaurant reviews (i.e., Rating, Amenity, Location, RestaurantName, Price, Hours, Dish, Cuisine).
Multiwoz	[12]	2018	Dialogues between a user and a virtual assistant covering multiple domains (e.g., restaurants, hotels, transportation).
NCBI-Disease	[32]	2016	PubMed abstracts annotated with disease mentions.
NEWS	[71]	2006	Data from news sources provided by Microsoft Research Asia.
OntoNotes	link\(^{4}\)	2013	News, telephone speech, weblogs, usenet newsgroups, broadcast and talk shows in three languages (English, Chinese, Arabic)
Sci-ERC	[88]	2018	Scientific papers and abstracts annotated with information related to biomedical events such as experiments, observations and findings.
SNIPS	[21]	2018	Spoken queries covering multiple domains including music, weather and navigation.
TOP	[48]	2018	Navigation and event search samples, where the 35% of the utterances contain multiple, nested intent labels.
Twitter	[45]	2015	Dataset from the WNUT 2016 NER shared task consisting of 2400 tweets with 10 entity types.
T-Twitter	[121]	2020	Tweets from 2014 to 2019 annotated with persons, locations and organizations.
W2019-food	[14]	2019	Samples from the 20190120 Wikipedia dump where ”Food and drink” category entities are separated in Drinks, Meat, Vegetables, Condiments and Breads entity types.
W2019-language	[14]	2019	Samples from the 20190120 Wikipedia dump in low-resource languages (i.e., Welsh, Bengali, Yoruba, Mongolian and Egyptian Arabic).
Webpage	[117]	2009	20 personal, academic, and computer science conference webpages covering 783 entities belonging to the same four types as in CoNLL03
WikiANN	[107]	2017	cross-lingual name tagging and linking based on Wikipedia articles in 295 languages.
WikiGold	[5]	2009	Wikipedia articles randomly selected from a 2008 English dump and manually annotated with the four CoNLL03 entity types.
WNUT17	[29]	2017	Dataset focused on the identification of unusual, previously unseen entities in the context of emerging discussions.
WSJ	[91]	1993	Wall Street Journal portion of Penn Treebank dataset containing 25 sections and categorizing each word into 45 POS tags.

Table 6. Datasets used in Evaluations of FS-NER Approaches (Listed in Alphabetical Order)

\(^{1}\)https://catalog.ldc.upenn.edu/LDC2006T06

\(^{2}\)http://www.ccks2019.cn/?pageid=62

\(^{3}\)http://ltrc.iiit.ac.in/icon/2013/nlptools/index.html

\(^{4}\)https://catalog.ldc.upenn.edu/LDC2013T19

References

[1]

David Ifeoluwa Adelani, Michael A. Hedderich, Dawei Zhu, Esther van den Berg, and Dietrich Klakow. 2020. Distant supervision and noisy label learning for low resource named entity recognition: A study on Hausa and Yorùbá. ICLR Workshops (AfricaNLP & PML4DC 2020), Apr 2020, Addis Ababa, Ethiopia. hal-03359111.

Abstract

1 Introduction

2 Few-Shot NER: What, Why, Where, and How

2.1 Problem Definition

2.2 The Need for FS-NER

2.3 Applications

2.4 Taxonomy

Model-centric Methods.

Data-centric Methods.

3 Model-centric Methods

3.1 Transfer Learning

3.1.1 Cross-domain Transfer.

3.1.2 Cross-language Transfer.

3.1.3 Prompt-based Transfer.

3.1.4 Performance Evaluation.

3.2 Meta-learning

3.2.1 Background.

3.2.2 Applications to Few-shot NER.

3.2.3 Performance Evaluation.

3.3 Summary and Discussion

4 Data-Centric Methods

4.1 Data Augmentation

4.1.1 Performance Evaluation.

4.2 Distant Supervision

4.2.1 Performance Evaluation.

4.3 Active Learning

4.4 Self-training

4.4.1 Performance Evaluation.

4.5 Summary and Discussion

5 Challenges and Research Directions

5.1 Practical Applicability: How Many Shots do I Need?

5.2 Performance Evaluation

5.3 Applications

5.4 The Ever-increasing Performance of Generative Models

5.5 A Focus on Developmental Approaches

6 Conclusion

Footnotes

A Benchmark Datasets

References

Cited By

Index Terms

Recommendations

Prompt-Based Self-training Framework for Few-Shot Named Entity Recognition

A Self-training Approach for Few-Shot Named Entity Recognition

Boosted Web Named Entity Recognition via Tri-Training

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations