[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking

Published: 08 February 2024 Publication History

Abstract

Entity linking is the task of assigning a unique identity to named entities mentioned in a text, a sort of word sense disambiguation that focuses on automatically determining a pre-defined sense for a target entity to be disambiguated. This study proposes the DGE (Dual Gloss Encoders) model for Chinese entity linking in the biomedical domain. We separately model a dual encoder architecture, comprising a context-aware gloss encoder and a lexical gloss encoder, for contextualized embedding representations. DGE are then jointly optimized to assign the nearest gloss with the highest score for target entity disambiguation. The experimental datasets consist of a total of 10,218 sentences that were manually annotated with glosses defined in the BabelNet 5.0 across 40 distinct biomedical entities. Experimental results show that the DGE model achieved an F1-score of 97.81, outperforming other existing methods. A series of model analyses indicate that the proposed approach is effective for Chinese biomedical entity linking.

1 Introduction

Word Sense Disambiguation (WSD) is a fundamental and long-standing task in natural language processing (NLP) that automatically assigns a pre-defined sense to a target word in a text. Identifying the correct word sense is important for natural language understanding because different senses reflect different meanings of a target word in a specific context determined by other words in the sentence. Entity linking (EL) assigns a unique identity to named entities mentioned in a text, a sort of WSD, especially when the target word is the concerned named entity.
Definitions of senses are usually called glosses, and have been shown to be valuable for improving WSD performance [Lesk 1986; Banerjee and Pedersen 2003; Basile et al. 2014; Luo et al. 2018a; 2018b]. The gloss information of the existing language resources for WSD is usually retrieved from WordNet [Miller 1995], an English language lexical database, providing an inventory for WSD systems to match a suitable sense of a lemma from lexicographic perspectives [Moro et al. 2014]. However, texts containing a named entity for EL may or may not exactly match the word form found in the lexical database. Compared with lexicon-based resources for EL, encyclopedia-based semantic networks that provide richer and more fine-grained information would be well-suited for domain-specific EL. BabelNet [Navigli and Ponzetto 2012] is a multilingual encyclopedic dictionary with lexicographic coverage of terms and a semantic ontology that contains concepts and named entities in a large network of semantic relations. BabelNet also follows the WordNet synonym set (known as synset) and extends it to contain multilingual lexicalizations. Each BabelNet synset represents a given meaning and its synonyms that express that meaning in different languages. Recently, BabelNet has emerged as a wide-coverage and lexical-semantic knowledge resource that integrates heterogeneous resources into a unified framework to facilitate various NLP tasks and applications in a range of different languages [Navigli et al. 2021].
Prior to scheduling a doctor's appointment for diagnosis and treatment of perceived medical issues, people frequently seek healthcare-related information online from health-related news articles, digital health services, and medical question-answering forums. Domain-specific healthcare information usually includes many biomedical named entities, presenting language processing challenges in terms of sense disambiguation for healthcare-related applications. For example, in a sentence “遠離結節並維持身體健康, 平時要維持一個愉悅的心情。” (To prevent nodules and maintain good health, we should usually try to keep a happy mood.), the correct meaning of “結節” (nodules) in the context should be linked to BabelNet synset ID bn:02588649n with the meaning “在醫學上, 結節是小小而堅硬的腫塊, 通常直徑大於1厘米” (In medicine, nodules are small firm lumps, usually greater than 1 cm in diameter.), rather than the other two synset IDs: bn:24656639n “在人類骨骼和至少其他哺乳動物的骨骼中, 結節或骨突是用作骨骼肌附件的突起或隆起。” (In the skeleton of humans and other animals, a tubercle, tuberosity or apophysis is a protrusion or eminence that serves as an attachment for skeletal muscles) and bn:00057856n “植物上的小圓形疣狀突起” (Small rounded wart-like protuberance on a plant.)
In this study, we propose a DGE (Dual Gloss Encoders) model, comprising a context-aware gloss encoder and a lexical gloss encoder, fully utilizing gloss knowledge for biomedical EL. We also construct a language resource called the Chinese HealthWSD corpus, containing 10,218 sentences across 40 biomedical target entities, each of which has been manually annotated with linking senses in BabelNet 5.0, for Chinese WSD in the biomedical domain. Experimental results show that our DGE model significantly outperforms the previous methods on our constructed dataset. Further empirical analysis and findings confirm our model effects.
The rest of this article is organized as follows. Section 2 reviews studies on WSD models and related language resources. Section 3 provides a detailed description of our proposed DGE model. Section 4 describes experiments used for performance evaluation, including corpus construction, result comparisons, and a series of model analyses. Conclusions are finally drawn in Section 5.

2 Related Work

EL and WSD both address lexical ambiguity. While these two tasks are fairly similar, they differ in terms of lexical item selection [Moro et al. 2014]. EL is a specific type of WSD in which the target is a named entity instead of a general word form. The EL task is to link a name entity occurring in a specific context to its particular sense as defined in the semantic network. Naturally, it belongs to the WSD research area. Hence, we investigate WSD models and language resources as follows.

2.1 WSD Models

Generally speaking, there are two kinds of WSD approaches: knowledge-based systems and supervised models.
(1)
Knowledge-based system. Knowledge-based WSD systems rely on lexical-semantic language resources such as WordNet [Miller 1995] and BabelNet [Navigli and Ponzetto 2012]. The Lesk algorithm was the first attempt to use a word meaning known as gloss for WSD [Lesk 1986]. The semantic relatedness measure between concepts was introduced based on the number of overlapping words in their glosses [Banerjee and Pedersen 2003]. The Most Frequent Sense (MFS) baseline was used to assign the mostly common WordNet sense to each word for WSD [Navigli et al. 2007]. An unsupervised graph-based method was proposed to combine similarity metrics and graph centrality for WSD [Sinha and Mihalcea 2007]. A vector space model was used to combine linguistic features with MeSH (Medical Subject Headings) terms for biomedical WSD [Stevenson et al. 2008]. A graph-based WSD algorithm was introduced to experimentally study graph connectivity for unsupervised WSD [Navigli and Lapata 2010]. An enhanced Lesk WSD algorithm was proposed to use a word similarity function defined on a distributional semantic space for gloss-context overlap computation [Basile et al. 2014]. Random walk was used to process excessively large lexical knowledge bases for knowledge-based WSD [Agirre et al. 2014]. Babelfy represents a unified graph-based approach to EL and WSD based on a loose identification of candidate meanings coupled with a subgraph heuristic that selects semantic interpretations [Moro et al. 2014]. Definitions from the UMLS (Unified Medical Language System) were used to combine word information for concept embedding representation, then using cosine similarity to select the nearest sense of the target word [Tulkens et al. 2016]. A combination of concept embedding from UMLS and word embedding was also used to address biomedical WSD based on a traditional KNN clustering algorithm [Sabbir et al. 2017]. The topic model was formulized to use the whole document as the context for a word to be disambiguated [Chaplot and Salakhutdinov 2018]. The semantic space and semantic path were respectively modeled by latent semantic analysis and PageRank for knowledge-based WSD [Wang et al. 2020]. Unsupervised methods were used to construct multilingual co-occurrence graphs for biomedical WSD [Duque et al. 2016].
(2)
Supervised models. Neural WSD models feed continuous word representations of a sentence into a neural network for model training and then build a classifier on top of learned features. The neural sequence tagging approach was explored for WSD by leveraging a bidirectional long short-term memory (LSTM) network [Kågebäck and Salomonsson 2016]. A self-attention layer on top of the concatenated bidirectional LSTM was proposed for WSD [Raganato et al. 2017b]. The glosses from WordNet and the semantic relationship between the concept and gloss were incorporated with a bidirectional LSTM network for WSD [Luo et al. 2018b]. A hierarchical co-attention mechanism was introduced to generate co-dependent representations for the context and gloss [Luo et al. 2018a]. The convolutional neural network (CNN) architecture was also used to achieve clinical abbreviation disambiguation [Joopudi et al. 2018]. Contextual features and word embeddings were compared in terms of their feature importance for supervised biomedical WSD [Yepes 2017]. Different strategies were explored to integrate pre-trained contextualized word representations to improve WSD performance [Hadiwinoto et al. 2019]. GlossBERT was proposed to construct context-gloss pairs from all possible senses of the target word in WordNet for WSD as a sentence-pair classification problem [Huang et al. 2019]. A jointly optimized bi-encoder model (BEM) was proposed to improve all-words English WSD [Blevins and Zettlemoyer 2020]. Enhanced WSD Integrating Synset Embeddings and Relations (EWISER) is a neural supervised architecture to incorporate knowledge graph information [Bevilacqua and Navigli 2020]. In addition to gloss definitions, additional information including synonyms, example phrases, and sense gloss of hypernyms were exploited to improve WSD performance [Song et al. 2021]. WSD was also reframed as a span extraction problem and a transformer-based neural architecture was proposed for the new formulation [Barba et al. 2021a]. Continuous Sense Comprehension (ConSeC) followed this task re-framing to regard WSD as a text extraction problem, introducing a feedback loop strategy that allows the disambiguation to be conditioned not only on the target word's context but also on the senses assigned to nearby words [Barba et al. 2021b].
In summary, neural-based supervised systems are now widely used for WSD, therefore the present research uses pre-trained embeddings as representations and explores DGE based on the transformer architecture for Chinese WSD in the biomedical domain.

2.2 WSD Language Resources

For English WSD, performance is typically evaluated using a unified evaluation framework [Raganato et al. 2017a], including two training sets, SemCor [Miller et al. 1993] and OMSTI [Taghipour and Ng 2015], and five test sets from the Senseval/SemEval series [Palmer et al. 2001; Snyder and Palmer 2004; Pradhan et al. 2007; Navigli et al. 2013; Moro and Navigli 2015], which all were standardized to the same format and sense inventory WordNet 3.0. SemCor [Miller et al. 1993] is the largest human-constructed WSD resource, containing 226,036 annotated examples covering 33,362 separate senses. Among the five test sets, the smallest SemEval-2007 task 17 dataset [Pradhan et al. 2007] containing 135 sentences is usually used as the development set.
Relatively little research has focused on Chinese WSD and standard benchmark data is lacking. In Senseval-3 task 5, the entries contained 20 different Chinese words annotated with HowNet senses [Litkowski 2004]. In SemEval-07 task 5 [Jin et al. 2007], a total of 40 Chinese ambiguous words, including 19 nouns and 21 verbs, were selected for WSD system development and evaluation. Each sense of a target word was annotated using the Chinese Semantic Dictionary (CSD), with 2,686 training instances and 935 test instances. Chinese OntoNotes Release 5.0 (LDC2013T19) containing word sense annotations for Chinese was used to evaluate WSD across four genres, including broadcast conversation, broadcast news, magazines, and newswires [Hadiwinoto et al. 2019].
Only two previous studies have focused on biomedical WSD data construction. NLM-WSD data [Weeber et al. 2001] contains 5,000 manually created instances comprising 50 ambiguous UMLS Meta-thesaurus strings from 1998 MEDLINE citations. The most recent MSH-WSD test collection [Jimeno-Yepes et al. 2011] was automatically created to cover 37,090 instances of ambiguous terms from 2010 MEDLINE citations using the UMLS Meta-thesaurus and manual MeSH indexing of MEDLINE.
To our best knowledge, there are no publicly available datasets designed for biomedical WSD in the Chinese language. In this article, we construct a language resource called the Chinese HealthWSD corpus, containing 10,218 sentences across 40 target entities, each manually annotated with linking senses in the BabelNet 5.0. This study represents the first effort to produce such a dataset for Chinese WSD in the biomedical domain and will be released as a language resource for further research.

3 DGE

In Chinese WSD, a sentence usually represents a sequence of characters denoted as \({c}_1{c}_2 \cdots {c}_i \cdots {c}_j \cdots {c}_n\), in which a target word that contains at least one continuous character such as from \({c}_i\) to \({c}_j\) where \(j \ge i\). The disambiguated word contains a set of candidate senses \(G:\{ {{g}_1,{g}_2,\ldots,{g}_k,\ldots,{g}_m} \}\) in a pre-defined sense inventory. Therefore, given a word w and the context in terms of a sentence s, the WSD system is a function such that \(f( {w,s} ) = g\) subject to \(g \in G\), which aims to find the most suitable gloss g for each target w in sentence s.
We propose a WSD model for Chinese biomedical EL. The overall architecture is shown in Figure 1. Our proposed DGEs model consists of two independent encoders: (1) a context-aware gloss encoder, which represents the surrounding text of the target entity along with the definition texts of each word sense; and (2) a lexical gloss encoder, which represents the definition texts of each word sense. These encoders are trained using a transformer network initialized with BERT [Devlin et al. 2019] to embed each token representation of the surrounding texts as contexts and each definition text as a gloss. We then score each candidate gloss for a target entity by taking the dot product of both encoder representations. The gloss with the highest score will be regarded as the correct word sense for model prediction output.
Fig. 1.
Fig. 1. Our proposed model architecture for Chinese biomedical entity linking.

3.1 Context-aware Gloss Encoder

Our encoder based on the BERT architecture [Devlin et al. 2019] independently encodes target entities with their contexts and glosses [Blevins and Zettlemoyer 2020; Humeau et al. 2020]. The embedding representation of the context-aware gloss encoder consists of three parts as follows:
(1)
Token Embedding. The sentence containing the target entity is denoted as a context sentence. We obtain all possible gloss sentences of each target entity from the sense inventory BabelNet 5.0. Each context-gloss pair is initialized using BERT-specific symbols: [CLS] and [SEP] tokens. The input token sequence \({X}_k\) is defined as follows: \([ {CLS} ]{c}_1{c}_2 \cdots \# {c}_i \cdots {c}_j\# \cdots {c}_n[ {SEP} ]{g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots [ {SEP} ]\), where the weak supervised signal “#” is used to emphasize the target entity in the context sentence [Huang et al. 2019] and the kth gloss of the target entity is also tokenized as \({g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots\) and so on. Similar to BERT, the first token is [CLS] as a classification tag. A context sentence and a gloss sentence are separated with the [SEP] token.
(2)
Segment Embedding.Different from traditional BERT that uses segmentation embedding to identify different sentences, we use the same segment tag “1” to represent all corresponding tokens, since a context sentence contains the target entity and each gloss sentence is the definition text of the target entity. Naturally, all tokens implicitly reflect the descriptions regarding the same target entity.
(3)
Position Embedding. We use mask-self-attention [Liu et al. 2020] to reflect the soft-position information of each input context-gloss sentence pair. We mask the position information of the tokens with the tag “-1” except for the positions of tokens belonging to the disambiguated entity to emphasize the target in the context sentence.
The encoder layers then produce a sequence of embedding representations. We define the representation of target word \({r}_w\) as shown in Equation (1), where the target word w is tokenized into multiple characters from the ith to jth tokens in a context-gloss pair sentence \({X}_k\). EC denotes the average of corresponding values of all encoder layers.
\begin{equation} {r}_{w,k} = \frac{1}{{j - i + 1}}\mathop \sum \limits_i^j \left( {{E}_C\left( {{X}_k} \right)\left[ i \right]} \right). \end{equation}
(1)

3.2 Lexical Gloss Encoder

The standard BERT architecture is used to train a lexical gloss encoder. The kth lexical gloss of target entity \({g}_k\) is tokenized as a sequence \({g}_{k1{\rm{\ }}}{g}_{k2{\rm{\ }}} \cdots {\rm{\ }}{g}_{kn{\rm{\ }}},\) where n is the number of tokens in the gloss. We define the lexical gloss encoder as Tg, taking the representation of kth lexical gloss \({r}_k\) as follows:
\begin{equation} {r}_k = {T}_g\left( {{g}_k} \right)\left[ 0 \right], \end{equation}
(2)
where we take the last output of the representation corresponding to the first input [CLS] token (the index of the token list is 0) by the lexical gloss encoder as a global representation of \({g}_k\).

3.3 Scoring

We then score each candidate sense \({g}_k \in G\) for a target entity w by taking the dot product of \({r}_{w,k}\) against every \({r}_k\) as follows:
\begin{equation} \emptyset \left( {w,{g}_k} \right) = {r}_{w,k} \cdot {r}_k. \end{equation}
(3)
We use a cross-entropy loss on the score for the candidate senses of the target word to train our proposed DGE model. The loss function of the pairing of a word w and a sense \({g}_k\) is as follows:
\begin{equation} L\left( {{\rm{w}},{g}_k} \right) = - \phi \left( {{\rm{w}},{g}_k} \right) + {\rm{log}}\mathop \sum \limits_{k = 1}^{\left| m \right|} {\rm{exp}}\left( {\phi \left( {{\rm{w}},{g}_k} \right)} \right). \end{equation}
(4)
During the inference phase, we predict the sense \(\hat{g}\) of the target word w to be the sense \({g}_k \in G\) whose representation \({r}_k\) has the highest dot product score with \({r}_{w,k}\).
Table 1 shows an example with a target entity to be disambiguated in our DGE model. A sentence “腺瘤性息肉癌變機率較高, 建議即時切除。” (Adenomatous polyps with higher cancer probability; prompt excision is recommended.) with a target entity “息肉” (polyp) to be disambiguated. There are two gloss definitions of polyp in BabelNet 5.0. During our DGE model inference, the first gloss “息肉是突出於黏膜的組織異常生長, 一般定義是長於中空器官的腔壁黏膜上。” (A small vascular growth on the surface of a mucous membrane.) with the highest score clearly will be returned as our system output for biomedical EL in this context sentence, rather than another gloss “共同服用的兩種形式中的一種 (例如, Hydra 或珊瑚): 通常用空心圓柱體的沉重, 通常在嘴周圍有一個觸手環。” (One of two forms taken by coelenterates (e.g., a hydra or coral): usually sedentary with a hollow cylindrical body usually with a ring of tentacles around the mouth.).
Table 1.
Sentence: 腺瘤性 息肉癌變機率較高, 建議即時切除。Target Entity: 息肉 (polyp)
(Adenomatous polyps have a higher probability to be a cancer and prompt excision is recommended.)
Context-Aware Gloss Encoder
X1: [CLS] 腺瘤性#息肉#癌變機率較高…… [SEP] 息肉是突出於黏膜的組織… [SEP]
X2: [CLS] 腺瘤性#息肉#癌變機率較高…… [SEP] 共同服用的兩種形式中的… [SEP]
Lexical Gloss Encoder
g1: [CLS] 息肉是突出於黏膜的組織異常生長… [SEP]
(A small vascular growth on the surface of a mucous membrane.)
g2: [CLS] 共同服用的兩種形式中的一種 (例如, Hydra或珊瑚) … [SEP]
(One of two forms that coelenterates take (e.g. a hydra or coral) …)
Dot product Score
\({X}_1 \cdot {g}_1\) 58.60 (the highest score: a correct sense g1)
\({X}_1 \cdot {g}_2\) 26.08
\({X}_2 \cdot {g}_1\) 57.87 (the second higher score, wrong context-gloss pair, but correct gloss)
\({X}_2 \cdot {g}_2\) 24.67
Table 1. An Illustrated Example for DGE Model

4 Experiments for Performance Evaluation

4.1 Datasets

Due to the lack of publicly available datasets, we built a corpus for WSD in the biomedical domain. We first selected named entities in the Chinese HealthNER corpus [Lee and Lu 2021] with coverage across 10 entity types (body, symptom, instrument, examinations, chemical, disease, drug, supplement, treatment, and time) as seed entities to search the BabelNet 5.0, a multilingual encyclopedic lexicon that contains named entities in a large network of semantic relations. A total of 735 distinct named entities contains at least two semantically different glosses in BabelNet. After manual checking, we selected 40 nouns as target biomedical entities that do not contain unclear or specific glosses such as names for creative works, author names, and so on.
We use these 40 distinct words as seed query terms to search and retrieve results containing sentences with target named entities. To efficiently and accurately crawl sentences consisting of ambiguous target words with different glosses, we manually allocated gloss-related keywords along with target words as queries that were automatically input into the Google search engine to obtain appropriate search results. We then crawled the corresponding web pages, removed all HTML tags, images, videos, and embedded web advertisements, split the remaining texts into sentences, and retained sentences containing at least one target entity word for manual annotation.
Three graduate students majoring in the electrical engineering system and biomedical group were trained in word sense tagging. Each sentence was annotated by three annotators to select an appropriate gloss of the target entity. All annotators were asked to discuss differences and seek consensus. If no consensus could be reached after discussion, those sentences were excluded in our constructed data.
Through this process, we created the Chinese HealthWSD corpus that can be used for EL research in the biomedical domain. Table 2 shows descriptive statistics. The most common entities have two glosses (29 target words or 72.5% of the total), followed by three glosses (seven words/17.5%) and four glosses (four words/10%). We have a total of 10,218 sentences containing biomedical named entities with annotated glosses in BabelNet 5.0. Each sentence contains an average of 52.49 tokens (characters and punctuations). Compared with the previous Chinese WSD data in SemEval-2007 task 5 [Jin et al. 2007] that contained 3,621 sentences across 40 target words, our constructed Chinese HealthWSD corpus has 2.8 times the number of sentences across 40 distinct biomedical entities, and is qualified for Chinese WSD research.
Table 2.
TypeTarget Entities#Gloss#Sent#Token
2
glosses
前臂 (forearm)、皮毛 (coat)、*部位 (component)、黏液 (mucus)、心臟病 (heart disease)、乳管 (lactiferous duct)、薄膜 (Biological membrane)、鼓膜 (eardrum)、胚囊 (gestational sec)、分泌物 (exudate)、組織 (tissue)、手足 (hands and feet)、多巴胺 (dopamine)、卵巢 (ovary)、超音波 (ultrasound)、顯影劑 (contrast medium)、息肉 (polyp)、炭疽病 (anthrax)、白斑 (vitiligo)、*緊身衣 (skin-tight garment)、*檢測 (assay)、*鏡頭 (camera lens)、呼吸管(snorkel)、牙套 (gumshield)、神經衰弱 (neurasthenia)、閉鎖 (atresia)、過敏反應 (allergy)、脂肪 (adipose tissue)、石膏 (gypsum)587,409393,726
3
glosses
眨眼 (blink)、結節 (nodule)、黑眼圈 (black eye)、乳房 (udder)、*香料 (spice)、穿刺 (centesis)、*手套 (glove)211,78092,676
4
glosses
隔膜 (diaphragm)、眼睛 (eye)、導管 (catheter)、眼罩 (blindfold)161,02949,931
Total40 distinct entities9510,218536,333
Table 2. Statistics of Chinese HealthWSD Corpus
“*” denotes the target word does not exist as a biomedical gloss in BabelNet 5.0.

4.2 Settings

The hold-out validation approach was used to tune the hyper-parameters and evaluate model performance. Table 3 presents descriptive statistics for the mutually exclusive datasets. 70% of our constructed corpus was randomly selected for model training, 10% was used for hyper-parameter tuning, and the remaining 20% was used for performance evaluation. Based on suggestions from related studies [Blevins and Zettlemoyer 2020], the hyper-parameter values for our proposed DGE model implementation were optimized as follows: number of training epochs 20, batch size 32, learning rate 0.00001, and max sequence length 256. In addition, Chinese-bert-wwm-ext [Cui et al. 2021] was used as the pre-trained embedding vectors.
Table 3.
Datasets#Sent#Token
Training7,109373,152
Validation97950,320
Test2,130112,861
All10,218536,333
Table 3. Experimental Dataset Statistics
For evaluation, we used the standard WSD system evaluation metrics of precision, recall, and F1-score. Precision is the ratio of biomedical entities recognized as correct glosses by the WSD system to the total identified entities. Recall is defined as the ratio of biomedical entities recognized as correct glosses by the WSD system to all the biomedical entities in the sentences. F1-score is the harmonic mean of the precision and recall. We used the entity-linking evaluation program released by SemEval-2015 task 13 [Moro and Navigli 2015] to measure system performance. Based on the metric definition for WSD performance measurement, precision and recall are the same, resulting in reporting the F1-score only in the previous WSD studies.

4.3 Results

The following three models were compared to demonstrate their performance for Chinese biomedical EL.
(1)
GlossBERT [Huang et al. 2019]. GlossBERT converted the WSD task to a sentence-pair classification task. The sentence-gloss pairs with the weak supervision mechanism to emphasize the target word in the sentence were constructed as inputs to feed into the BERT model for gloss classification.
(2)
BERT-wsd [Hadiwinoto et al. 2019]. BERT-wsd proposed different strategies of pre-trained contextualized word representations, including the linear projection of hidden layers and nearest neighbor matching, to improve BERT performance for the WSD task.
(3)
BEM [Blevins and Zettlemoyer 2020]. BEM used a bi-encoder model to separately consider the embeddings of the target word in the sentence and the corresponding glosses to be disambiguated. The encoders are jointly optimized to assign the nearest gloss embedding for each target word.
Table 4 compares the results for various EL models on our constructed Chinese HealthWSD datasets. The F1-score obtained by our proposed DGE model showed significant differences with all other models (p-value < 0.05). GlossBERT [Huang et al. 2019] made the pairing of a sentence and the gloss of each target word as the input and considered the embedding of the whole pair to predict which glosses were correct. BERT-wsd [Hadiwinoto et al. 2019] identified the correct gloss based on the target word embedding only. Both models confirmed the effectiveness of the BERT model and its embedding effects. BEM [Blevins and Zettlemoyer 2020] considered not only the word embedding of the target word but also the gloss of the target word in the bi-encoder architecture that achieved the second-best F1-score of 96.78. This obvious performance enhancement reflects the usefulness of the encoder architecture.
Table 4.
ModelF1-score (%)
GlossBERT [Huang et al. 2019]81.36
BERT-wsd [Hadiwinoto et al. 2019]81.56
BEM [Blevins and Zettlemoyer 2020]96.78
DGE (ours)97.81
Table 4. Experimental Results on Chinese HealthWSD Datasets
Our proposed DGE method used the gloss and weak supervised method to lead the attention mechanism to focus on the target word, providing a best F1-score of 97.81, slightly outperforming the BEM model with a statistically significant difference (p-value is 0.0339). In summary, we can find that our DGE model is an effective solution for the Chinese EL task in the biomedical domain.

4.4 In-depth Analysis

We further discuss the findings of the proposed DGE model in the following aspects.
(1)
Ablation study.
We conducted an ablation study of the context-aware gloss encoder in our proposed DGE model by removing the weak supervised mechanism and selecting a different output embedding vector from either a target entity or a [CLS] token, and different representation values from either the last encoder layer or an average of all layers.
Table 5 shows a performance comparison. We first removed the weak supervised mechanism (marked as “– weak”). This may cause the attentional mechanism to lose focus on the target entity, thus degrading performance. We further replaced the output embedding from the original target entity with the special classification [CLS] token (marked as “– weak, target → [CLS]”). It's clear that our context-aware gloss encoder experienced larger performance loss, indicating that the embedding vector of the target entity is an essential clue reflecting more specific representation rather than the [CLS] token that usually represents the meaning of the whole sentence. We also replaced the embedding values from the original average of all encoder layers with that of the last layer (marked as “– weak, average → last”). This significantly reduced performance, indicating that, despite the widespread use of the last layer as embedding vector, averaging all layers achieves better results.
Table 5.
DGE ModelF1-score
Context-aware Gloss Encoder
(weak + target + average)
97.81
– weak97.34
– weak, target → [CLS]93.13
– weak, average → last96.89
Table 5. Ablation Study for Context-Aware Encoder
(2)
Embedding effects.
To understand the effect of using different embedding methods for representations of the target entity and its corresponding glosses, we compare BERT [Cui et al. 2021] with RoBERTa [Liu et al. 2019] and MacBERT [Cui et al. 2020], with results summarized in Table 6. Our BERT embedding achieved the best result, although there is no clear difference due to the same BERT-like model architecture for embedding representation.
Table 6.
Word EmbeddingsF1-score
Chinese-bert-wwm-ext (Cui et al. 2021)97.81
Chinese-roberta-wwm-ext (Liu et al. 2019)97.69
Chinese-macbert-base (Cui et al. 2021)97.58
Table 6. Performance Comparison of Different Embedding Methods
(3)
Domain-specific gloss effects.
Among the 40 named entities in our constructed Chinese HealthWSD corpus, biomedical glosses cannot be found in the BabelNet 5.0 for six target entities. We further verify the effects of the presence or absence of domain-specific glosses for the entities, with results shown in Table 7. Entities with biomedical glosses achieved a slightly better F1-score, because using a relatively larger part of domain-specific words in a context sentence benefits the domain-specific gloss selection.
Table 7.
Biomedical GlossNumber of Target EntitiesF1-score
Yes3497.87
No697.43
Table 7. Performance Comparison of Domain-Specific Gloss Effects
(4)
Gloss number effects.
We analyzed the biomedical entities with different numbers of glosses. Table 8 shows a performance comparison. Entities with two glosses had an easier time choosing correctly, achieving the best F1-score of 98.38. However, entities with four glosses outperformed those with three glosses, indicating the number of glosses is not directly related to WSD system performance. Some biomedical entities with three glosses such as “黑眼圈” (blink or jiffy) and “眨眼” (black eye or dark circles) contained highly similar semantics, making it difficult to identify the correct gloss even with manual annotation.
Table 8.
Gloss NumberNumber of Target WordsF1-score
2 glosses2998.38
3 glosses795.73
4 glosses497.3
Table 8. Performance Comparison of Gloss Number Effects

4.5 Error Analysis

Although our DGE model achieved promising results and outperformed previous work, we conducted additional analysis to help identify the root of the error. A total of 47 error cases were further divided into 3 error types as follows.
(1)
Insufficient context (13 error cases, accounting for 27.66%).
A sentence may not have sufficient context for WSD. For example, in a sentence “此數值越高, 表示卵巢功能越差。” (The higher the value is, the worse the ovaries are functioning.), the correct meaning of “卵巢” (ovaries) should be BabelNet ID bn:00059826n 卵巢在解剖學中是指動物雌性體內製造卵子的一對性腺體。” (one of usually two organs that produce ova and secrete estrogen and progesterone.), instead of bn:00059825n “子房是被子植物生長種子的器官, 位置不定, 依和雌蕊的相對位置可分為上位花、下位花或是周位花。” (The organ for growing seeds in angiosperms, is a hollow structure located below the style of the female floral organ and is generally slightly swollen.). Clearly, there was insufficient context in the sentence to identify the correct value, making it relatively difficult for the WSD system to determine the correct gloss.
(2)
Highly complicated and implicit semantics (17 cases/36.17%).
Some sentences contain implicit semantics such as the sentence “間不容瞬是指眨眼的時間都沒有。” (Jiffy means that no time to blink.), in which the correct meaning of “眨眼” (blink) should be linked to BabelNet ID bn: 00011275n with the meaning “眨眼是眼瞼開合的動作。” (A reflex that closes and opens the eyes rapidly.), different from bn:00011276n “很短的時間 (如眨眼或心跳所需的時間) 。” “A very short time (as the time it takes the eye to blink or the heart to beat).” The human blink time in this sentence is used to imply and interpret as a very short time, so “blink” should refer to the reflex caused by the eyes closing and opening rapidly.
(3)
Unknown errors (17 cases/36.17%).
Some error cases cannot be categorized as either of the above two error types. For example, for “來自愛爾蘭的海藻精華富含天然的多種營養成分, 可增進生長因子IGF-1, 有助皮毛生長、減少掉毛……。” (Seaweed extract from Ireland is rich in natural multi-nutrients and boosts the growth factor IGF-1, which can help the coat grow and reduce shedding…), the correct gloss should be BabelNet ID bn:00020186n “覆蓋動物身體的頭髮、羊毛或毛皮的生長。” (Growth of hair or wool or fur covering the body of an animal), instead of BabelNet ID bn:00072299n “對一個主題的輕微或膚淺的了解。” (A slight or superficial understanding of a subject.) This kind of sentence contains enough context and doesn't present implicit semantics, but it's difficult to know exactly what caused the wrong gloss selection.

4.6 Discussion

A bi-encoder architecture called BEM [Blevins and Zettlemoyer 2020] has been used to independently exploit the context sentence and gloss text for optimized performance in the same representation space, achieving promising results for English WSD. Following the success of this work, we designed a bi-encoder method named DGE to enhance the performance for Chinese WSD. In experiments, BEM showed clear performance improvements compared with single encoder methods (i.e., GlossBERT [Huang et al. 2019] and BERT-wsd [Hadiwinoto et al. 2019]) and our DGE model provides additional performance enhancement, confirming the effectiveness of the bi-encoder architecture.
Our DGE method consists of a context-aware gloss encoder and a lexical gloss encoder. The lexical gloss encoder is designed to represent the gloss itself without additional texts, so this encoder cannot be used alone to perform the WSD task. The context-aware gloss encoder contains a context sentence and the target entity to be disambiguated. Using the context-aware gloss encoder alone is very close to GlossBERT [Huang et al. 2019], which constructs context-gloss pairs from glosses of all possible senses of the target word to treat the WSD task as a sentence-pair classification.
The input of both encoders in our proposed model is a series of characters without domain-specific token injection. Therefore, our DGE model can be directly applied to open-domain without additional limitations. In addition, our DGE model is language-independent, and may be used for any texts written left-to-right.

5 Conclusions

We propose a DGE model for Chinese biomedical EL, making the following contributions:
(1)
We present a dual encoder architecture based on gloss knowledge for target entity disambiguation.
The encoders separately model the gloss knowledge, including a context-aware gloss encoder with a weak supervised mechanism and a lexical gloss encoder containing sense texts for contextualized embedding representations. DGEs are then jointly optimized to assign the nearest gloss with the highest score for WSD. Our DGE model achieved an F1-score of 97.81, significantly outperforming existing models for Chinese biomedical EL.
(2)
We construct a Chinese WSD dataset for word sense disambiguation and EL in the biomedical domain.
To our best knowledge, our constructed dataset is the first Chinese biomedical WSD corpus. It includes a total of 10,218 sentences with around 0.53 million characters. After manual annotation, we have 95 glosses found in the BabelNet (version 5.0) across 40 distinct biomedical entities. We plan to release our constructed Chinese HealthWSD corpus as a language resource for further research.

References

[1]
E. Agirre, O. López de Lacalle, and A. Soroa. 2014. Random walks for knowledge-based word sense disambiguation. Computational Linguistics 40, 1 (2014), 57–84.
[2]
S. Banerjee and T. Pedersen. 2003. Extended gloss overlaps as a measure of semantic relatedness. In Proceedings of the 18th International Joint Conference on Artificial Intelligence. 805–810.
[3]
E. Barba, T. Pasini, and R. Navigli. 2021a. ESC: Redesigning WSD with extractive sense comprehension. In Proceedings of the 2021 Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4661–4672.
[4]
E. Barba, L. Procopio, and R. Navigli. 2021b. ConSec: Word sense disambiguation as continuous sense comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 1492–1503.
[5]
P. Basile, A. Caputo, and G. Semeraro. 2014. An enhanced Lesk word sense disambiguation algorithm through a distributional semantic model. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. 1591–1600.
[6]
M. Bevilacqua and R. Navigli. 2020. Breaking through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorporating knowledge graph information. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2854–2864.
[7]
T. Blevins and L. Zettlemoyer. 2020. Moving down the long tail of word sense disambiguation with gloss informed bi-encoders. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1006–1017.
[8]
D. S. Chaplot and R. Salakhutdinov. 2018. Knowledge-based word sense disambiguation using topic models. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence. 5062–5069.
[9]
Y. Cui, W. Che, T. Liu, B. Qin, S. Wang, and G. Hu. 2020. Revisiting pre-trained models for Chinese natural language processing. Findings of the Association for Computational Linguistics: EMNLP. 657–668.
[10]
Y. Cui, W. Che, T. Liu, B. Qin, and Z. Yang. 2021. Pre-training with whole word masking for Chinese BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3504–3514.
[11]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
[12]
A. Duque, J. Martinez-Romo, and L. Araujo. 2016. Can multilinguality improve biomedical word sense disambiguation? Journal of Biomedical Informatics 64 (2016), 320–332.
[13]
C. Hadiwinoto, H. T. Ng, and W C. Gan. 2019. Improved word sense disambiguation using pre-trained contextualized word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 5297–5306.
[14]
L. Huang, C. Sun, X. Qiu, and X. Huang. 2019. GlossBERT: BERT for word sense disambiguation with gloss knowledge. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. 3509–3514.
[15]
S. Humeau, K. Shuster, M.-A. Lachaux, and J. Weston. 2020. Poly-encoders: Transformer architectures and pre-training strategies for fast and accurate multi-sentence scoring. In Proceedings of the 8th International Conference on Learning Representations. 1–14.
[16]
A. J. Jimeno-Yepes, B. T. McLnnes, and A. R. Aronson. 2011. Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation. BMC Bioinformatics 12, 223 (2011), 1–14.
[17]
P. Jin, Y. Wu, and S. Yu. 2007. SemEval-2007 task 5: Multilingual Chinese-English lexical sample. In Proceedings of the 4th International Workshop on Semantic Evaluations. 19–23.
[18]
V. Joopudi, B. Dandala, and M. Devarakonda. 2018. A convolutional route to abbreviation disambiguation in clinical text. Journal of Biomedical Informatics 86 (2018), 71–78.
[19]
M. Kågebäck and H. Salomonsson. 2016. Word sense disambiguation using a bidirectional LSTM. In Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon. 51–56.
[20]
L. H. Lee and Y. Lu. 2021. Multiple embeddings enhanced multi-graph neural networks for Chinese healthcare named entity recognition. IEEE Journal of Biomedical and Health Informatics 25, 7 (2021), 2801–2810.
[21]
M. Lesk. 1986. Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation. 24–26.
[22]
K. Litkowski. 2004. Senseval-3 task: word sense disambiguation of WordNet glosses. In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. 13–16.
[23]
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
[24]
W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang. 2020. K-BERT: enabling language representation with knowledge graph. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, 34, 3 (2020), 2901--2908.
[25]
F. Luo, T. Liu, Z. He, Q. Xia, Z. Sui, and B. Chang. 2018a. Leveraging gloss knowledge in neural word sense disambiguation by hierarchical co-attention. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1402–1411.
[26]
F. Luo, T. Liu, Q. Xia, B. Chang, and Z. Sui. 2018b. Incorporating glosses into neural word sense disambiguation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2473–2482
[27]
G. A. Miller. 1995. Wordnet: A lexical database for English. Communications of the ACM 38, 11 (1995), 39–41.
[28]
G. A. Miller, C. Leacock, R. Tengi, and R. T. Bunker. 1993. Using a semantic concordance for sense identification. In Proceedings of the Human Language Technology Workshop. 21–24.
[29]
A. Moro and R. Navigli. 2015. Semeval-2015 task 13: Multilingual all-words sense disambiguation and entity linking. In Proceedings of the 9th International Workshop on Semantic Evaluation. 288–297.
[30]
A. Moro, A. Raganato, and R. Navigli. 2014. Entity linking meets word sense disambiguation: A unified approach. Transactions of the Association for Computational Linguistics 2 (2014), 231–244.
[31]
R. Navigli, M. Bevilacqua, S. Conia, D. Montagnini, and F. Cecconi. 2021. Ten years of BabelNet: a survey. In Proceedings of the 30th International Joint Conference on Artificial Intelligence. 4559–4567.
[32]
R. Navigli, D. Jurgens, and D. Vannella. 2013. Semeval-2013 task 12: Multilingual word sense disambiguation. In Proceedings of the 7th International Workshop on Semantic Evaluation. 222–231.
[33]
R. Navigli and M. Lapata. 2010. An experimental study of graph connectivity for unsupervised word sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, 4 (2010), 678–692.
[34]
R. Navigli, K. C. Litkowski, and O. Hargraves. 2007. SemEval-2007 task 07: Coarse-grained English all-words task. In Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval ’07). 30–35.
[35]
R. Navigli and S. Ponzetto. 2012. BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artificial Intelligence 193 (2012), 217–250.
[36]
M. Palmer, C. Fellbaum, S. Cotton, L. Delfs, and H. T. Dang. 2001. English tasks: all-words and verb lexical sample. In Proceedings of the 2nd International Workshop on Evaluating Word Sense Disambiguation Systems. 21–24.
[37]
S. Pradhan, E Loper, D. Dligach, and M Palmer. 2007. Semeval-2007 task-17: English lexical sample, SRL and all words. In Proceedings of the 4th International Workshop on Semantic Evaluations. 87–92.
[38]
A. Raganato, C. D. Bovi, and R. Navigli. 2017b. Neural sequence learning models for word sense disambiguation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1156–1167.
[39]
A. Raganato, J. Camacho-Collados, and R. Navigli. 2017a. Word sense disambiguation: A unified evaluation framework and empirical comparison. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. 99–110.
[40]
A. Sabbir, A. Jimeno-Yepes, and R. Kavuluru. 2017. Knowledge-based biomedical word sense disambiguation with neural concept embeddings. In Proceedings of the 17th IEEE International Conference on Bioinformatics Bioengineering. 163–170.
[41]
R. Sinha and R. Mihalcea. 2007. Unsupervised graph-based word sense disambiguation using measures of word semantic similarity. In Proceedings of the IEEE International Conference on Semantic Computing. 363–369.
[42]
B. Snyder and M. Palmer. 2004. The English all-word task. In Proceedings of the 3rd International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. 41–43.
[43]
Y. Song, X. C. Ong, H. T. Ng, and Q. Lin. 2021. Improved word sense disambiguation with enhanced sense representations. Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics, 4311–4320.
[44]
M. Stevenson, Y. Guo, R. Gaizauskas, and D. Martinez. 2008. Knowledge sources for word sense disambiguation of biomedical text. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing. 80–87.
[45]
K. Taghipour and H. T. Ng. 2015. One million sense-tagged instances for word sense disambiguation and induction. In Proceedings of the 19th Conference on Computational Natural Language Processing. 338–344.
[46]
S. Tulkens, S. Suster, and W. Daelemans. 2016. Using distributed representations to disambiguate biomedical and clinical concepts. In Proceedings of the 15th Workshop on Biomedical Natural Language Processing. 77–82.
[47]
Y. Wang, M. Wang, and H. Fujita. 2020. Word sense disambiguation: a comprehensive knowledge exploitation framework. Knowledge-Based Systems 190 (2020), 105030.
[48]
M. Weeber, J. G. Mork, and A. R. Aronson. 2001. Developing a test collection for biomedical word sense disambiguation. In Proceedings of the AMIA 2001 Annual Symposium. 746–750.
[49]
A. J. Yepes. 2017. Word embeddings and recurrent neural networks based on Long-Short Term Memory nodes in supervised biomedical word sense disambiguation. Journal of Biomedical Informatics 73 (2017), 137–147.

Index Terms

  1. Leveraging Dual Gloss Encoders in Chinese Biomedical Entity Linking

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 2
    February 2024
    340 pages
    EISSN:2375-4702
    DOI:10.1145/3613556
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 08 February 2024
    Online AM: 28 December 2023
    Accepted: 14 December 2023
    Revised: 17 August 2023
    Received: 29 January 2023
    Published in TALLIP Volume 23, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Word sense disambiguation
    2. lexical semantics
    3. language transformers
    4. natural language understanding
    5. biomedical informatics

    Qualifiers

    • Research-article

    Funding Sources

    • National Science and Technology Council, Taiwan

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 569
      Total Downloads
    • Downloads (Last 12 months)561
    • Downloads (Last 6 weeks)56
    Reflects downloads up to 18 Jan 2025

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media