research-article

Open access

Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking

Authors:

Tzu-Mi Lin,

Man-Chen Hung,

Lung-Hao LeeAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 2

Article No.: 28, Pages 1 - 15

https://doi.org/10.1145/3638555

Published: 08 February 2024 Publication History

PDF eReader

Abstract

Entity linking is the task of assigning a unique identity to named entities mentioned in a text, a sort of word sense disambiguation that focuses on automatically determining a pre-defined sense for a target entity to be disambiguated. This study proposes the DGE (Dual Gloss Encoders) model for Chinese entity linking in the biomedical domain. We separately model a dual encoder architecture, comprising a context-aware gloss encoder and a lexical gloss encoder, for contextualized embedding representations. DGE are then jointly optimized to assign the nearest gloss with the highest score for target entity disambiguation. The experimental datasets consist of a total of 10,218 sentences that were manually annotated with glosses defined in the BabelNet 5.0 across 40 distinct biomedical entities. Experimental results show that the DGE model achieved an F1-score of 97.81, outperforming other existing methods. A series of model analyses indicate that the proposed approach is effective for Chinese biomedical entity linking.

1 Introduction

Word Sense Disambiguation (WSD) is a fundamental and long-standing task in natural language processing (NLP) that automatically assigns a pre-defined sense to a target word in a text. Identifying the correct word sense is important for natural language understanding because different senses reflect different meanings of a target word in a specific context determined by other words in the sentence. Entity linking (EL) assigns a unique identity to named entities mentioned in a text, a sort of WSD, especially when the target word is the concerned named entity.

Definitions of senses are usually called glosses, and have been shown to be valuable for improving WSD performance [Lesk 1986; Banerjee and Pedersen 2003; Basile et al. 2014; Luo et al. 2018a; 2018b]. The gloss information of the existing language resources for WSD is usually retrieved from WordNet [Miller 1995], an English language lexical database, providing an inventory for WSD systems to match a suitable sense of a lemma from lexicographic perspectives [Moro et al. 2014]. However, texts containing a named entity for EL may or may not exactly match the word form found in the lexical database. Compared with lexicon-based resources for EL, encyclopedia-based semantic networks that provide richer and more fine-grained information would be well-suited for domain-specific EL. BabelNet [Navigli and Ponzetto 2012] is a multilingual encyclopedic dictionary with lexicographic coverage of terms and a semantic ontology that contains concepts and named entities in a large network of semantic relations. BabelNet also follows the WordNet synonym set (known as synset) and extends it to contain multilingual lexicalizations. Each BabelNet synset represents a given meaning and its synonyms that express that meaning in different languages. Recently, BabelNet has emerged as a wide-coverage and lexical-semantic knowledge resource that integrates heterogeneous resources into a unified framework to facilitate various NLP tasks and applications in a range of different languages [Navigli et al. 2021].

Prior to scheduling a doctor's appointment for diagnosis and treatment of perceived medical issues, people frequently seek healthcare-related information online from health-related news articles, digital health services, and medical question-answering forums. Domain-specific healthcare information usually includes many biomedical named entities, presenting language processing challenges in terms of sense disambiguation for healthcare-related applications. For example, in a sentence “遠離結節並維持身體健康, 平時要維持一個愉悅的心情。” (To prevent nodules and maintain good health, we should usually try to keep a happy mood.), the correct meaning of “結節” (nodules) in the context should be linked to BabelNet synset ID bn:02588649n with the meaning “在醫學上, 結節是小小而堅硬的腫塊, 通常直徑大於1厘米” (In medicine, nodules are small firm lumps, usually greater than 1 cm in diameter.), rather than the other two synset IDs: bn:24656639n “在人類骨骼和至少其他哺乳動物的骨骼中, 結節或骨突是用作骨骼肌附件的突起或隆起。” (In the skeleton of humans and other animals, a tubercle, tuberosity or apophysis is a protrusion or eminence that serves as an attachment for skeletal muscles) and bn:00057856n “植物上的小圓形疣狀突起” (Small rounded wart-like protuberance on a plant.)

In this study, we propose a DGE (Dual Gloss Encoders) model, comprising a context-aware gloss encoder and a lexical gloss encoder, fully utilizing gloss knowledge for biomedical EL. We also construct a language resource called the Chinese HealthWSD corpus, containing 10,218 sentences across 40 biomedical target entities, each of which has been manually annotated with linking senses in BabelNet 5.0, for Chinese WSD in the biomedical domain. Experimental results show that our DGE model significantly outperforms the previous methods on our constructed dataset. Further empirical analysis and findings confirm our model effects.

The rest of this article is organized as follows. Section 2 reviews studies on WSD models and related language resources. Section 3 provides a detailed description of our proposed DGE model. Section 4 describes experiments used for performance evaluation, including corpus construction, result comparisons, and a series of model analyses. Conclusions are finally drawn in Section 5.

2 Related Work

EL and WSD both address lexical ambiguity. While these two tasks are fairly similar, they differ in terms of lexical item selection [Moro et al. 2014]. EL is a specific type of WSD in which the target is a named entity instead of a general word form. The EL task is to link a name entity occurring in a specific context to its particular sense as defined in the semantic network. Naturally, it belongs to the WSD research area. Hence, we investigate WSD models and language resources as follows.

2.1 WSD Models

Generally speaking, there are two kinds of WSD approaches: knowledge-based systems and supervised models.

(1)

Knowledge-based system. Knowledge-based WSD systems rely on lexical-semantic language resources such as WordNet [Miller 1995] and BabelNet [Navigli and Ponzetto 2012]. The Lesk algorithm was the first attempt to use a word meaning known as gloss for WSD [Lesk 1986]. The semantic relatedness measure between concepts was introduced based on the number of overlapping words in their glosses [Banerjee and Pedersen 2003]. The Most Frequent Sense (MFS) baseline was used to assign the mostly common WordNet sense to each word for WSD [Navigli et al. 2007]. An unsupervised graph-based method was proposed to combine similarity metrics and graph centrality for WSD [Sinha and Mihalcea 2007]. A vector space model was used to combine linguistic features with MeSH (Medical Subject Headings) terms for biomedical WSD [Stevenson et al. 2008]. A graph-based WSD algorithm was introduced to experimentally study graph connectivity for unsupervised WSD [Navigli and Lapata 2010]. An enhanced Lesk WSD algorithm was proposed to use a word similarity function defined on a distributional semantic space for gloss-context overlap computation [Basile et al. 2014]. Random walk was used to process excessively large lexical knowledge bases for knowledge-based WSD [Agirre et al. 2014]. Babelfy represents a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a subgraph heuristic that selects semantic interpretations [Moro et al. 2014]. Definitions from the UMLS (Unified Medical Language System) were used to combine word information for concept embedding representation, then using cosine similarity to select the nearest sense of the target word [Tulkens et al. 2016]. A combination of concept embedding from UMLS and word embedding was also used to address biomedical WSD based on a traditional KNN clustering algorithm [Sabbir et al. 2017]. The topic model was formulized to use the whole document as the context for a word to be disambiguated [Chaplot and Salakhutdinov 2018]. The semantic space and semantic path were respectively modeled by latent semantic analysis and PageRank for knowledge-based WSD [Wang et al. 2020]. Unsupervised methods were used to construct multilingual co-occurrence graphs for biomedical WSD [Duque et al. 2016].

(2)

Supervised models. Neural WSD models feed continuous word representations of a sentence into a neural network for model training and then build a classifier on top of learned features. The neural sequence tagging approach was explored for WSD by leveraging a bidirectional long short-term memory (LSTM) network [Kågebäck and Salomonsson 2016]. A self-attention layer on top of the concatenated bidirectional LSTM was proposed for WSD [Raganato et al. 2017b]. The glosses from WordNet and the semantic relationship between the concept and gloss were incorporated with a bidirectional LSTM network for WSD [Luo et al. 2018b]. A hierarchical co-attention mechanism was introduced to generate co-dependent representations for the context and gloss [Luo et al. 2018a]. The convolutional neural network (CNN) architecture was also used to achieve clinical abbreviation disambiguation [Joopudi et al. 2018]. Contextual features and word embeddings were compared in terms of their feature importance for supervised biomedical WSD [Yepes 2017]. Different strategies were explored to integrate pre-trained contextualized word representations to improve WSD performance [Hadiwinoto et al. 2019]. GlossBERT was proposed to construct context-gloss pairs from all possible senses of the target word in WordNet for WSD as a sentence-pair classification problem [Huang et al. 2019]. A jointly optimized bi-encoder model (BEM) was proposed to improve all-words English WSD [Blevins and Zettlemoyer 2020]. Enhanced WSD Integrating Synset Embeddings and Relations (EWISER) is a neural supervised architecture to incorporate knowledge graph information [Bevilacqua and Navigli 2020]. In addition to gloss definitions, additional information including synonyms, example phrases, and sense gloss of hypernyms were exploited to improve WSD performance [Song et al. 2021]. WSD was also reframed as a span extraction problem and a transformer-based neural architecture was proposed for the new formulation [Barba et al. 2021a]. Continuous Sense Comprehension (ConSeC) followed this task re-framing to regard WSD as a text extraction problem, introducing a feedback loop strategy that allows the disambiguation to be conditioned not only on the target word's context but also on the senses assigned to nearby words [Barba et al. 2021b].

In summary, neural-based supervised systems are now widely used for WSD, therefore the present research uses pre-trained embeddings as representations and explores DGE based on the transformer architecture for Chinese WSD in the biomedical domain.

2.2 WSD Language Resources

For English WSD, performance is typically evaluated using a unified evaluation framework [Raganato et al. 2017a], including two training sets, SemCor [Miller et al. 1993] and OMSTI [Taghipour and Ng 2015], and five test sets from the Senseval/SemEval series [Palmer et al. 2001; Snyder and Palmer 2004; Pradhan et al. 2007; Navigli et al. 2013; Moro and Navigli 2015], which all were standardized to the same format and sense inventory WordNet 3.0. SemCor [Miller et al. 1993] is the largest human-constructed WSD resource, containing 226,036 annotated examples covering 33,362 separate senses. Among the five test sets, the smallest SemEval-2007 task 17 dataset [Pradhan et al. 2007] containing 135 sentences is usually used as the development set.

Relatively little research has focused on Chinese WSD and standard benchmark data is lacking. In Senseval-3 task 5, the entries contained 20 different Chinese words annotated with HowNet senses [Litkowski 2004]. In SemEval-07 task 5 [Jin et al. 2007], a total of 40 Chinese ambiguous words, including 19 nouns and 21 verbs, were selected for WSD system development and evaluation. Each sense of a target word was annotated using the Chinese Semantic Dictionary (CSD), with 2,686 training instances and 935 test instances. Chinese OntoNotes Release 5.0 (LDC2013T19) containing word sense annotations for Chinese was used to evaluate WSD across four genres, including broadcast conversation, broadcast news, magazines, and newswires [Hadiwinoto et al. 2019].

Only two previous studies have focused on biomedical WSD data construction. NLM-WSD data [Weeber et al. 2001] contains 5,000 manually created instances comprising 50 ambiguous UMLS Meta-thesaurus strings from 1998 MEDLINE citations. The most recent MSH-WSD test collection [Jimeno-Yepes et al. 2011] was automatically created to cover 37,090 instances of ambiguous terms from 2010 MEDLINE citations using the UMLS Meta-thesaurus and manual MeSH indexing of MEDLINE.

To our best knowledge, there are no publicly available datasets designed for biomedical WSD in the Chinese language. In this article, we construct a language resource called the Chinese HealthWSD corpus, containing 10,218 sentences across 40 target entities, each manually annotated with linking senses in the BabelNet 5.0. This study represents the first effort to produce such a dataset for Chinese WSD in the biomedical domain and will be released as a language resource for further research.

3 DGE

In Chinese WSD, a sentence usually represents a sequence of characters denoted as \({c}_1{c}_2 \cdots {c}_i \cdots {c}_j \cdots {c}_n\), in which a target word that contains at least one continuous character such as from \({c}_i\) to \({c}_j\) where \(j \ge i\). The disambiguated word contains a set of candidate senses \(G:\{ {{g}_1,{g}_2,\ldots,{g}_k,\ldots,{g}_m} \}\) in a pre-defined sense inventory. Therefore, given a word w and the context in terms of a sentence s, the WSD system is a function such that \(f( {w,s} ) = g\) subject to \(g \in G\), which aims to find the most suitable gloss g for each target w in sentence s.

We propose a WSD model for Chinese biomedical EL. The overall architecture is shown in Figure 1. Our proposed DGEs model consists of two independent encoders: (1) a context-aware gloss encoder, which represents the surrounding text of the target entity along with the definition texts of each word sense; and (2) a lexical gloss encoder, which represents the definition texts of each word sense. These encoders are trained using a transformer network initialized with BERT [Devlin et al. 2019] to embed each token representation of the surrounding texts as contexts and each definition text as a gloss. We then score each candidate gloss for a target entity by taking the dot product of both encoder representations. The gloss with the highest score will be regarded as the correct word sense for model prediction output.

Fig. 1.

3.1 Context-aware Gloss Encoder

Our encoder based on the BERT architecture [Devlin et al. 2019] independently encodes target entities with their contexts and glosses [Blevins and Zettlemoyer 2020; Humeau et al. 2020]. The embedding representation of the context-aware gloss encoder consists of three parts as follows:

(1)

Token Embedding. The sentence containing the target entity is denoted as a context sentence. We obtain all possible gloss sentences of each target entity from the sense inventory BabelNet 5.0. Each context-gloss pair is initialized using BERT-specific symbols: [CLS] and [SEP] tokens. The input token sequence \({X}_k\) is defined as follows: \([ {CLS} ]{c}_1{c}_2 \cdots \# {c}_i \cdots {c}_j\# \cdots {c}_n[ {SEP} ]{g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots [ {SEP} ]\), where the weak supervised signal “#” is used to emphasize the target entity in the context sentence [Huang et al. 2019] and the kth gloss of the target entity is also tokenized as \({g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots\) and so on. Similar to BERT, the first token is [CLS] as a classification tag. A context sentence and a gloss sentence are separated with the [SEP] token.

(2)

Segment Embedding.Different from traditional BERT that uses segmentation embedding to identify different sentences, we use the same segment tag “1” to represent all corresponding tokens, since a context sentence contains the target entity and each gloss sentence is the definition text of the target entity. Naturally, all tokens implicitly reflect the descriptions regarding the same target entity.

(3)

Position Embedding. We use mask-self-attention [Liu et al. 2020] to reflect the soft-position information of each input context-gloss sentence pair. We mask the position information of the tokens with the tag “-1” except for the positions of tokens belonging to the disambiguated entity to emphasize the target in the context sentence.

The encoder layers then produce a sequence of embedding representations. We define the representation of target word \({r}_w\) as shown in Equation (1), where the target word w is tokenized into multiple characters from the ith to jth tokens in a context-gloss pair sentence \({X}_k\). E_C denotes the average of corresponding values of all encoder layers.

\begin{equation} {r}_{w,k} = \frac{1}{{j - i + 1}}\mathop \sum \limits_i^j \left( {{E}_C\left( {{X}_k} \right)\left[ i \right]} \right). \end{equation}

(1)

3.2 Lexical Gloss Encoder

The standard BERT architecture is used to train a lexical gloss encoder. The kth lexical gloss of target entity \({g}_k\) is tokenized as a sequence \({g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots {\rm{\ }}{g}_{kn{\rm{\ }}},\) where n is the number of tokens in the gloss. We define the lexical gloss encoder as Tg, taking the representation of k^th lexical gloss \({r}_k\) as follows:

\begin{equation} {r}_k = {T}_g\left( {{g}_k} \right)\left[ 0 \right], \end{equation}

(2)

where we take the last output of the representation corresponding to the first input [CLS] token (the index of the token list is 0) by the lexical gloss encoder as a global representation of \({g}_k\).

3.3 Scoring

We then score each candidate sense \({g}_k \in G\) for a target entity w by taking the dot product of \({r}_{w,k}\) against every \({r}_k\) as follows:

\begin{equation} \emptyset \left( {w,{g}_k} \right) = {r}_{w,k} \cdot {r}_k. \end{equation}

(3)

We use a cross-entropy loss on the score for the candidate senses of the target word to train our proposed DGE model. The loss function of the pairing of a word w and a sense \({g}_k\) is as follows:

\begin{equation} L\left( {{\rm{w}},{g}_k} \right) = - \phi \left( {{\rm{w}},{g}_k} \right) + {\rm{log}}\mathop \sum \limits_{k = 1}^{\left| m \right|} {\rm{exp}}\left( {\phi \left( {{\rm{w}},{g}_k} \right)} \right). \end{equation}

(4)

During the inference phase, we predict the sense \(\hat{g}\) of the target word w to be the sense \({g}_k \in G\) whose representation \({r}_k\) has the highest dot product score with \({r}_{w,k}\).

Table 1 shows an example with a target entity to be disambiguated in our DGE model. A sentence “腺瘤性息肉癌變機率較高, 建議即時切除。” (Adenomatous polyps with higher cancer probability; prompt excision is recommended.) with a target entity “息肉” (polyp) to be disambiguated. There are two gloss definitions of polyp in BabelNet 5.0. During our DGE model inference, the first gloss “息肉是突出於黏膜的組織異常生長, 一般定義是長於中空器官的腔壁黏膜上。” (A small vascular growth on the surface of a mucous membrane.) with the highest score clearly will be returned as our system output for biomedical EL in this context sentence, rather than another gloss “共同服用的兩種形式中的一種 (例如, Hydra 或珊瑚): 通常用空心圓柱體的沉重, 通常在嘴周圍有一個觸手環。” (One of two forms taken by coelenterates (e.g., a hydra or coral): usually sedentary with a hollow cylindrical body usually with a ring of tentacles around the mouth.).

Table 1.

Sentence: 腺瘤性息肉癌變機率較高, 建議即時切除。Target Entity: 息肉 (polyp) (Adenomatous polyps have a higher probability to be a cancer and prompt excision is recommended.)
Context-Aware Gloss Encoder
X₁: [CLS] 腺瘤性#息肉#癌變機率較高…… [SEP] 息肉是突出於黏膜的組織… [SEP]
X₂: [CLS] 腺瘤性#息肉#癌變機率較高…… [SEP] 共同服用的兩種形式中的… [SEP]
Lexical Gloss Encoder
g₁: [CLS] 息肉是突出於黏膜的組織異常生長… [SEP] (A small vascular growth on the surface of a mucous membrane.)
g₂: [CLS] 共同服用的兩種形式中的一種 (例如, Hydra或珊瑚) … [SEP] (One of two forms that coelenterates take (e.g. a hydra or coral) …)
Dot product Score
\({X}_1 \cdot {g}_1\) 58.60 (the highest score: a correct sense g₁)
\({X}_1 \cdot {g}_2\) 26.08
\({X}_2 \cdot {g}_1\) 57.87 (the second higher score, wrong context-gloss pair, but correct gloss)
\({X}_2 \cdot {g}_2\) 24.67

Table 1. An Illustrated Example for DGE Model

4 Experiments for Performance Evaluation

4.1 Datasets

Due to the lack of publicly available datasets, we built a corpus for WSD in the biomedical domain. We first selected named entities in the Chinese HealthNER corpus [Lee and Lu 2021] with coverage across 10 entity types (body, symptom, instrument, examinations, chemical, disease, drug, supplement, treatment, and time) as seed entities to search the BabelNet 5.0, a multilingual encyclopedic lexicon that contains named entities in a large network of semantic relations. A total of 735 distinct named entities contains at least two semantically different glosses in BabelNet. After manual checking, we selected 40 nouns as target biomedical entities that do not contain unclear or specific glosses such as names for creative works, author names, and so on.

We use these 40 distinct words as seed query terms to search and retrieve results containing sentences with target named entities. To efficiently and accurately crawl sentences consisting of ambiguous target words with different glosses, we manually allocated gloss-related keywords along with target words as queries that were automatically input into the Google search engine to obtain appropriate search results. We then crawled the corresponding web pages, removed all HTML tags, images, videos, and embedded web advertisements, split the remaining texts into sentences, and retained sentences containing at least one target entity word for manual annotation.

Three graduate students majoring in the electrical engineering system and biomedical group were trained in word sense tagging. Each sentence was annotated by three annotators to select an appropriate gloss of the target entity. All annotators were asked to discuss differences and seek consensus. If no consensus could be reached after discussion, those sentences were excluded in our constructed data.

Through this process, we created the Chinese HealthWSD corpus that can be used for EL research in the biomedical domain. Table 2 shows descriptive statistics. The most common entities have two glosses (29 target words or 72.5% of the total), followed by three glosses (seven words/17.5%) and four glosses (four words/10%). We have a total of 10,218 sentences containing biomedical named entities with annotated glosses in BabelNet 5.0. Each sentence contains an average of 52.49 tokens (characters and punctuations). Compared with the previous Chinese WSD data in SemEval-2007 task 5 [Jin et al. 2007] that contained 3,621 sentences across 40 target words, our constructed Chinese HealthWSD corpus has 2.8 times the number of sentences across 40 distinct biomedical entities, and is qualified for Chinese WSD research.

Table 2.

Type	Target Entities	#Gloss	#Sent	#Token
2 glosses	前臂 (forearm)、皮毛 (coat)、部位 (component)、黏液 (mucus)、心臟病 (heart disease)、乳管 (lactiferous duct)、薄膜 (Biological membrane)、鼓膜 (eardrum)、胚囊 (gestational sec)、分泌物 (exudate)、組織 (tissue)、手足 (hands and feet)、多巴胺 (dopamine)、卵巢 (ovary)、超音波 (ultrasound)、顯影劑 (contrast medium)、息肉 (polyp)、炭疽病 (anthrax)、白斑 (vitiligo)、緊身衣 (skin-tight garment)、檢測 (assay)、鏡頭 (camera lens)、呼吸管(snorkel)、牙套 (gumshield)、神經衰弱 (neurasthenia)、閉鎖 (atresia)、過敏反應 (allergy)、脂肪 (adipose tissue)、石膏 (gypsum)	58	7,409	393,726
3 glosses	眨眼 (blink)、結節 (nodule)、黑眼圈 (black eye)、乳房 (udder)、香料 (spice)、穿刺 (centesis)、手套 (glove)	21	1,780	92,676
4 glosses	隔膜 (diaphragm)、眼睛 (eye)、導管 (catheter)、眼罩 (blindfold)	16	1,029	49,931
Total	40 distinct entities	95	10,218	536,333

Table 2. Statistics of Chinese HealthWSD Corpus

“*” denotes the target word does not exist as a biomedical gloss in BabelNet 5.0.

4.2 Settings

The hold-out validation approach was used to tune the hyper-parameters and evaluate model performance. Table 3 presents descriptive statistics for the mutually exclusive datasets. 70% of our constructed corpus was randomly selected for model training, 10% was used for hyper-parameter tuning, and the remaining 20% was used for performance evaluation. Based on suggestions from related studies [Blevins and Zettlemoyer 2020], the hyper-parameter values for our proposed DGE model implementation were optimized as follows: number of training epochs 20, batch size 32, learning rate 0.00001, and max sequence length 256. In addition, Chinese-bert-wwm-ext [Cui et al. 2021] was used as the pre-trained embedding vectors.

Table 3.

Datasets	#Sent	#Token
Training	7,109	373,152
Validation	979	50,320
Test	2,130	112,861
All	10,218	536,333

Table 3. Experimental Dataset Statistics

For evaluation, we used the standard WSD system evaluation metrics of precision, recall, and F1-score. Precision is the ratio of biomedical entities recognized as correct glosses by the WSD system to the total identified entities. Recall is defined as the ratio of biomedical entities recognized as correct glosses by the WSD system to all the biomedical entities in the sentences. F1-score is the harmonic mean of the precision and recall. We used the entity-linking evaluation program released by SemEval-2015 task 13 [Moro and Navigli 2015] to measure system performance. Based on the metric definition for WSD performance measurement, precision and recall are the same, resulting in reporting the F1-score only in the previous WSD studies.

4.3 Results

The following three models were compared to demonstrate their performance for Chinese biomedical EL.

(1)

GlossBERT [Huang et al. 2019]. GlossBERT converted the WSD task to a sentence-pair classification task. The sentence-gloss pairs with the weak supervision mechanism to emphasize the target word in the sentence were constructed as inputs to feed into the BERT model for gloss classification.

(2)

BERT-wsd [Hadiwinoto et al. 2019]. BERT-wsd proposed different strategies of pre-trained contextualized word representations, including the linear projection of hidden layers and nearest neighbor matching, to improve BERT performance for the WSD task.

(3)

BEM [Blevins and Zettlemoyer 2020]. BEM used a bi-encoder model to separately consider the embeddings of the target word in the sentence and the corresponding glosses to be disambiguated. The encoders are jointly optimized to assign the nearest gloss embedding for each target word.

Table 4 compares the results for various EL models on our constructed Chinese HealthWSD datasets. The F1-score obtained by our proposed DGE model showed significant differences with all other models (p-value < 0.05). GlossBERT [Huang et al. 2019] made the pairing of a sentence and the gloss of each target word as the input and considered the embedding of the whole pair to predict which glosses were correct. BERT-wsd [Hadiwinoto et al. 2019] identified the correct gloss based on the target word embedding only. Both models confirmed the effectiveness of the BERT model and its embedding effects. BEM [Blevins and Zettlemoyer 2020] considered not only the word embedding of the target word but also the gloss of the target word in the bi-encoder architecture that achieved the second-best F1-score of 96.78. This obvious performance enhancement reflects the usefulness of the encoder architecture.

Table 4.

Model	F1-score (%)
GlossBERT [Huang et al. 2019]	81.36
BERT-wsd [Hadiwinoto et al. 2019]	81.56
BEM [Blevins and Zettlemoyer 2020]	96.78
DGE (ours)	97.81

Table 4. Experimental Results on Chinese HealthWSD Datasets

Our proposed DGE method used the gloss and weak supervised method to lead the attention mechanism to focus on the target word, providing a best F1-score of 97.81, slightly outperforming the BEM model with a statistically significant difference (p-value is 0.0339). In summary, we can find that our DGE model is an effective solution for the Chinese EL task in the biomedical domain.

4.4 In-depth Analysis

We further discuss the findings of the proposed DGE model in the following aspects.

(1)

Ablation study.

We conducted an ablation study of the context-aware gloss encoder in our proposed DGE model by removing the weak supervised mechanism and selecting a different output embedding vector from either a target entity or a [CLS] token, and different representation values from either the last encoder layer or an average of all layers.

Table 5 shows a performance comparison. We first removed the weak supervised mechanism (marked as “– weak”). This may cause the attentional mechanism to lose focus on the target entity, thus degrading performance. We further replaced the output embedding from the original target entity with the special classification [CLS] token (marked as “– weak, target → [CLS]”). It's clear that our context-aware gloss encoder experienced larger performance loss, indicating that the embedding vector of the target entity is an essential clue reflecting more specific representation rather than the [CLS] token that usually represents the meaning of the whole sentence. We also replaced the embedding values from the original average of all encoder layers with that of the last layer (marked as “– weak, average → last”). This significantly reduced performance, indicating that, despite the widespread use of the last layer as embedding vector, averaging all layers achieves better results.

Table 5.

DGE Model	F1-score
Context-aware Gloss Encoder (weak + target + average)	97.81
– weak	97.34
– weak, target → [CLS]	93.13
– weak, average → last	96.89

Table 5. Ablation Study for Context-Aware Encoder

(2)

Embedding effects.

To understand the effect of using different embedding methods for representations of the target entity and its corresponding glosses, we compare BERT [Cui et al. 2021] with RoBERTa [Liu et al. 2019] and MacBERT [Cui et al. 2020], with results summarized in Table 6. Our BERT embedding achieved the best result, although there is no clear difference due to the same BERT-like model architecture for embedding representation.

Table 6.

Word Embeddings	F1-score
Chinese-bert-wwm-ext (Cui et al. 2021)	97.81
Chinese-roberta-wwm-ext (Liu et al. 2019)	97.69
Chinese-macbert-base (Cui et al. 2021)	97.58

Table 6. Performance Comparison of Different Embedding Methods

(3)

Domain-specific gloss effects.

Among the 40 named entities in our constructed Chinese HealthWSD corpus, biomedical glosses cannot be found in the BabelNet 5.0 for six target entities. We further verify the effects of the presence or absence of domain-specific glosses for the entities, with results shown in Table 7. Entities with biomedical glosses achieved a slightly better F1-score, because using a relatively larger part of domain-specific words in a context sentence benefits the domain-specific gloss selection.

Table 7.

Biomedical Gloss	Number of Target Entities	F1-score
Yes	34	97.87
No	6	97.43

Table 7. Performance Comparison of Domain-Specific Gloss Effects

(4)

Gloss number effects.

We analyzed the biomedical entities with different numbers of glosses. Table 8 shows a performance comparison. Entities with two glosses had an easier time choosing correctly, achieving the best F1-score of 98.38. However, entities with four glosses outperformed those with three glosses, indicating the number of glosses is not directly related to WSD system performance. Some biomedical entities with three glosses such as “黑眼圈” (blink or jiffy) and “眨眼” (black eye or dark circles) contained highly similar semantics, making it difficult to identify the correct gloss even with manual annotation.

Table 8.

Gloss Number	Number of Target Words	F1-score
2 glosses	29	98.38
3 glosses	7	95.73
4 glosses	4	97.3

Table 8. Performance Comparison of Gloss Number Effects

4.5 Error Analysis

Although our DGE model achieved promising results and outperformed previous work, we conducted additional analysis to help identify the root of the error. A total of 47 error cases were further divided into 3 error types as follows.

(1)

Insufficient context (13 error cases, accounting for 27.66%).

A sentence may not have sufficient context for WSD. For example, in a sentence “此數值越高, 表示卵巢功能越差。” (The higher the value is, the worse the ovaries are functioning.), the correct meaning of “卵巢” (ovaries) should be BabelNet ID bn:00059826n 卵巢在解剖學中是指動物雌性體內製造卵子的一對性腺體。” (one of usually two organs that produce ova and secrete estrogen and progesterone.), instead of bn:00059825n “子房是被子植物生長種子的器官, 位置不定, 依和雌蕊的相對位置可分為上位花、下位花或是周位花。” (The organ for growing seeds in angiosperms, is a hollow structure located below the style of the female floral organ and is generally slightly swollen.). Clearly, there was insufficient context in the sentence to identify the correct value, making it relatively difficult for the WSD system to determine the correct gloss.

(2)

Highly complicated and implicit semantics (17 cases/36.17%).

Some sentences contain implicit semantics such as the sentence “間不容瞬是指眨眼的時間都沒有。” (Jiffy means that no time to blink.), in which the correct meaning of “眨眼” (blink) should be linked to BabelNet ID bn: 00011275n with the meaning “眨眼是眼瞼開合的動作。” (A reflex that closes and opens the eyes rapidly.), different from bn:00011276n “很短的時間 (如眨眼或心跳所需的時間) 。” “A very short time (as the time it takes the eye to blink or the heart to beat).” The human blink time in this sentence is used to imply and interpret as a very short time, so “blink” should refer to the reflex caused by the eyes closing and opening rapidly.

(3)

Unknown errors (17 cases/36.17%).

Some error cases cannot be categorized as either of the above two error types. For example, for “來自愛爾蘭的海藻精華富含天然的多種營養成分, 可增進生長因子IGF-1, 有助皮毛生長、減少掉毛……。” (Seaweed extract from Ireland is rich in natural multi-nutrients and boosts the growth factor IGF-1, which can help the coat grow and reduce shedding…), the correct gloss should be BabelNet ID bn:00020186n “覆蓋動物身體的頭髮、羊毛或毛皮的生長。” (Growth of hair or wool or fur covering the body of an animal), instead of BabelNet ID bn:00072299n “對一個主題的輕微或膚淺的了解。” (A slight or superficial understanding of a subject.) This kind of sentence contains enough context and doesn't present implicit semantics, but it's difficult to know exactly what caused the wrong gloss selection.

4.6 Discussion

A bi-encoder architecture called BEM [Blevins and Zettlemoyer 2020] has been used to independently exploit the context sentence and gloss text for optimized performance in the same representation space, achieving promising results for English WSD. Following the success of this work, we designed a bi-encoder method named DGE to enhance the performance for Chinese WSD. In experiments, BEM showed clear performance improvements compared with single encoder methods (i.e., GlossBERT [Huang et al. 2019] and BERT-wsd [Hadiwinoto et al. 2019]) and our DGE model provides additional performance enhancement, confirming the effectiveness of the bi-encoder architecture.

Our DGE method consists of a context-aware gloss encoder and a lexical gloss encoder. The lexical gloss encoder is designed to represent the gloss itself without additional texts, so this encoder cannot be used alone to perform the WSD task. The context-aware gloss encoder contains a context sentence and the target entity to be disambiguated. Using the context-aware gloss encoder alone is very close to GlossBERT [Huang et al. 2019], which constructs context-gloss pairs from glosses of all possible senses of the target word to treat the WSD task as a sentence-pair classification.

The input of both encoders in our proposed model is a series of characters without domain-specific token injection. Therefore, our DGE model can be directly applied to open-domain without additional limitations. In addition, our DGE model is language-independent, and may be used for any texts written left-to-right.

5 Conclusions

We propose a DGE model for Chinese biomedical EL, making the following contributions:

(1)

We present a dual encoder architecture based on gloss knowledge for target entity disambiguation.

The encoders separately model the gloss knowledge, including a context-aware gloss encoder with a weak supervised mechanism and a lexical gloss encoder containing sense texts for contextualized embedding representations. DGEs are then jointly optimized to assign the nearest gloss with the highest score for WSD. Our DGE model achieved an F1-score of 97.81, significantly outperforming existing models for Chinese biomedical EL.

(2)

We construct a Chinese WSD dataset for word sense disambiguation and EL in the biomedical domain.

To our best knowledge, our constructed dataset is the first Chinese biomedical WSD corpus. It includes a total of 10,218 sentences with around 0.53 million characters. After manual annotation, we have 95 glosses found in the BabelNet (version 5.0) across 40 distinct biomedical entities. We plan to release our constructed Chinese HealthWSD corpus as a language resource for further research.

References

[1]

E. Agirre, O. López de Lacalle, and A. Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics 40, 1 (2014), 57–84.

Abstract

1 Introduction

2 Related Work

2.1 WSD Models

2.2 WSD Language Resources

3 DGE

3.1 Context-aware Gloss Encoder

3.2 Lexical Gloss Encoder

3.3 Scoring

4 Experiments for Performance Evaluation

4.1 Datasets

4.2 Settings

4.3 Results

4.4 In-depth Analysis

4.5 Error Analysis

4.6 Discussion

5 Conclusions

References

Index Terms

Recommendations

Entity linking leveraging: automatically generated annotation

Arabic word sense disambiguation: a review

Biomedical text disambiguation using UMLS

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations