Open AccessArticle

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Hanjo Jeong

Department of Software Convergence Engineering, Mokpo National University, Muan 58554, Republic of Korea

Mathematics 2025, 13(5), 864; https://doi.org/10.3390/math13050864

Submission received: 24 January 2025 / Revised: 3 March 2025 / Accepted: 3 March 2025 / Published: 5 March 2025

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

The disambiguation of word senses (Word Sense Disambiguation, WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Due to the complex morphological structure and polysemy of the Korean language, the meaning of words can change depending on the context, making the WSD problem challenging. Since a single word can have multiple meanings, accurately distinguishing between them is essential for improving the performance of NLP models. Recently, large-scale pre-trained models like BERT and GPT, based on transfer learning, have shown promising results in addressing this issue. However, for languages with complex morphological structures, like Korean, the tokenization mismatch between pre-trained models and fine-tuning data prevents the rich contextual and lexical information learned by the pre-trained models from being fully utilized in downstream tasks. This paper proposes a novel method to address the tokenization mismatch issue during the fine-tuning of Korean WSD, leveraging BERT-based pre-trained models and the Sejong corpus, which has been annotated by language experts. Experimental results using various BERT-based pre-trained models and datasets from the Sejong corpus demonstrate that the proposed method improves performance by approximately 3–5% compared to existing approaches.

Keywords:

word sense disambiguation; BERT; Sejong corpus; tokenization; transfer learning

MSC:

68T07

1. Introduction

Word Sense Disambiguation (WSD) plays a crucial role in various natural language processing (NLP) tasks, such as machine translation, sentiment analysis, and information retrieval. Korean, in particular, poses a challenge for WSD due to its complex morphological structure and polysemy, where a word’s meaning can vary depending on the context in which it is used [1,2]. Since a single word can have multiple meanings, accurately disambiguating these meanings is essential for enhancing the performance of NLP models.

Similarly to traditional machine learning algorithms, which are broadly categorized into supervised and unsupervised learning, machine learning-based WSD methods can also be classified into these two approaches. Supervised learning methods train classifiers using manually labeled corpora that associate words with their correct senses. Notable methods include Support Vector Machines (SVMs) [3], Naïve Bayes [4,5], and Artificial Neural Network (ANN)-based approaches. In contrast, unsupervised learning methods determine word senses by clustering word occurrence patterns within sentences using unlabeled corpora. Representative approaches include clustering based on knowledge sources such as WordNet and thesauri, which utilize gloss words and synonym words to infer word senses [6,7,8]. Another approach clusters contextual lexical information extracted from large-scale corpora like Wikipedia [9]. Additionally, in some cases, a hybrid approach that combines clustering and supervised learning is used. However, a potential issue arises when the implicitly clustered word senses do not perfectly align with the labeled word senses [10,11]. The most effective approach among these is the supervised learning strategy [11,12].

Recently, large-scale pre-trained models, such as Bidirectional Encoder Representations from Transformers (BERT) [13] and Generative Pre-trained Transformer (GPT) [14], have demonstrated promising results in addressing various NLP challenges through transfer learning. However, in agglutinative languages like Korean, which exhibit complexities such as spacing and tokenization inconsistencies, accurate WSD remains challenging due to token mismatch problems [15].

Typically, to extend processing from character-level tokens to word-level or sentence-level representations, additional Convolutional Neural Network (CNN) layers are employed along with various pooling methods [16,17,18,19]. However, in languages with intricate morphological structures, addressing token mismatch problems using CNN layers is suboptimal. This is because a single word can be split into a varying number of tokens, making it difficult to merge these tokens effectively using a fixed window (kernel) size.

To address this issue, this paper proposes a novel method that retains the contextual information of subword tokens while ensuring that only the label information, crucial for fine-tuning, is propagated from the original word’s label to the separated subword tokens. This label information is subsequently processed using an attention mechanism. For Korean WSD, the proposed approach is validated by fine-tuning a BERT-based pre-trained model with the Sejong corpus [20], a dataset annotated by language experts, to address tokenization mismatch issues. The contributions of this research are as follows:

A novel BERT-based approach is proposed to address the token mismatch problem that can occur when using transfer learning-based models for natural language processing of languages with complex morphological structures, like Korean. This problem arises from the discrepancies between the tokens of pre-trained models and fine-tuning training data.
The proposed model introduces specific embedding methods and attention weight learning methods to enable the model to learn the relationships between words and subword tokens through an attention mechanism. Furthermore, a method for incorporating POS (Part-of-Speech) information into the input embedding is also proposed, which is useful for WSD.
To evaluate the effectiveness of the proposed method, the Sejong corpus, a representative dataset tagged by language experts was used for implementation, and various experiments were conducted. The results showed an improvement of approximately 3–5% over the existing subword embedding integration methods.

The structure of this paper is as follows. Section 2 introduces the characteristics of the Sejong corpus used for training data, discusses the token mismatch issues when using BERT pre-trained models, and presents the proposed embedding method and attention-based learning model designed to resolve these issues. Section 3 presents various experimental results and analyses to evaluate the effectiveness of the proposed method. Finally, Section 4 concludes the paper and describes the experimental findings.

2. Materials and Methods

2.1. Sejong Corpus Dataset

For the experiments on WSD algorithms, the Sejong Corpus Lexical Semantic Analysis Dataset 2020 (Version 2.0) [20] was used. This dataset is an updated lexical semantic corpus constructed in 2022, building upon the 2019 version, which comprises 150,000 sentences, including approximately 100,000 written sentences and 50,000 spoken sentences. The corpus has been annotated by language experts for morphological analysis and lexical semantics, covering both content and predicate words. Table 1 presents the morphological tag list for the content and predicate words used in constructing the lexical semantic analysis corpus.

Table 2 provides an example of lexical semantic and morphological tagging for a sample sentence from the Sejong Corpus Lexical Semantic Dataset. All content and predicate words in the sentence are tagged at the root word level, with their respective meanings annotated as sense IDs, and their morphological information annotated as morphological tags.

2.2. Input Embeddings Using POS and Label Propagation

To fine-tune the pre-trained models using additional training data, the data must first be tokenized using the pre-trained model’s tokenizer to generate consistent subword token embeddings. Alternatively, for tasks like Named Entity Recognition (NER), additional data can be pre-tokenized and annotated with BIO (Beginning, Inside, Outside) tagging frameworks [21] before fine-tuning. However, unlike NER tasks, which focus on a limited set of entities [22,23], WSD tasks require BIO tags for all input words. Consequently, the number of BIO tags may exceed the number of words, making this approach less effective for fine-tuning due to the excessive number of class labels to classify.

Table 3 shows the subword tokens of the example sentence from Table 2, tokenized using the tokenizers of the BERT multilingual base model (bert-base-multilingual-cased) [24] and the Korean Comments BERT models [25] (kcbert-base [26], kcbert-large [27]). The tokenization results indicate that kcbert-base and kcbert-large models produce identical subword tokens due to their shared vocabulary. Most pre-trained models employ WordPiece tokenizers [28] to tokenize at the character level, enabling the use of maximal contextual information. As a result, words in the Sejong corpus, which are tokenized at the word level, are split into subword tokens at the character level.

As shown in Table 3, subword tokens often bear meanings distinct from the original word before tokenization. While pre-trained models like BERT can infer the approximate meanings of subword tokens based on context, some cases in Korean make this inference impossible. Additionally, subword tokens of foreign words written in Korean phonetic transcriptions may deviate significantly from the original word’s semantics. For example, the word ‘캔버스’ (Canvas) is tokenized into the subwords ‘캔’ (Can) and ‘버스’ (Bus). In Korean, these subword tokens are independently used as loanwords with distinct meanings unrelated to the original word ‘Canvas’.

When fine-tuning pre-trained models using the Sejong corpus, sense labels are annotated on the original word tokens rather than the subword tokens generated by the tokenizer. This discrepancy creates token mismatch issues, making it infeasible to directly utilize labeled semantic annotations for fine-tuning. To address this, the proposed method propagates the original word’s sense labels to its corresponding subword tokens while preserving their contextual information. These propagated labels are then processed using an attention mechanism. Additionally, morphological information of words is also propagated to the corresponding subword tokens and integrated into the input embeddings.

Feature 1 shows an example of input and output embedding vectors generated by the proposed method using the words, POS tags, and sense labels from Table 2, along with the subword tokens generated by the bert-base-multilingual-cased model in Table 3. Morphological and sense label information for each subword token is derived from the corresponding word token. For subword tokens unrelated to any word token, encoding values are set to NULL (Figure 1). A detailed explanation of the model utilizing the proposed method is provided in Section 2.3.

2.3. Attention-Based Fine-Tuning Model Using Context-Preserving Subword Label Assignment

Figure 2 shows the overall architecture of the proposed model. First, as shown in Figure 1, WordPiece token information derived from each word in the fine-tuning training data are used to propagate POS information and label information. Based on this, token embeddings, POS embeddings, segment embeddings, and position embeddings are generated. These embeddings are then added together to form the Input Embeddings.

The generated input embeddings are processed through 12 (BERT-base) or 24 (BERT-large) BERT layers based on Multi-Head Attention for training. During evaluation, a Linear layer and a Softmax layer are used to compute the predicted probability values for the sense ID labels of each input token. The remaining section formulates the methods for adjusting attention weights during model training.

In a given sentence

S = \{w_{1}, w_{2}, \dots, w_{n}\}

, a word

w_{i}

can be split into multiple subword tokens

t_{i 1}, t_{i 2}, \dots, t_{i m}

. For example, the word ‘청계천_Cheonggyecheon’ is split into three subword tokens: ‘청_Cheong’, ‘##계_gye’, and ‘##천_cheon’. Thus, for each word

w_{i}

, the subword tokens

t_{i j}

are mapped as shown in Equation (1):

w_{i} \to \{t_{i 1}, t_{i 2}, \dots, t_{i m}\}

(1)

The relationship between each word

w_{i}

and its subword token

t_{i j}

is represented by the attention mechanism’s weights in the attention layer, as shown in Equation (2):

A t t e n t i o n W e i g h t (w_{i}, t_{i j}) = s o f t m a x (\frac{Q_{i} K_{i j}^{T}}{\sqrt{d_{k}}})

(2)

where

Q_{i}

represents the query vector for word

w_{i}

K_{i j}

represents the key vector for the subword token

t_{i j}

, and

d_{k}

denotes the dimension size of the vectors. Additionally, the attention weights for each subword token

t_{i j}

for each word

w_{i}

are normalized using softmax.

To map the sense labels of each word to the sense labels of the subword tokens

t_{i j}

, this paper proposes a method where the label of the word is propagated to each of its subword tokens. That is, all subword tokens

t_{i j}

split from

w_{i}

are assigned the same sense label. The mapping probability for each label of the subword token is calculated as

p_{i j}^{(l)}

, as shown in Equation (3):

p_{i j}^{(l)} = \sum_{k} A t t e n t i o n W e i g h t (w_{i}, t_{i j})

(3)

The predicted class label for each subword token is determined by the class label with the highest mapping probability, using the argmax function, as shown in Equation (4):

\hat{l_{i j}} = \underset{l}{\arg \max} (p_{i j}^{(l)})

(4)

Finally, the loss function used for training, based on Cross-Entropy, is calculated as shown in Equation (5):

L o s s = - \frac{1}{N} \sum_{i = 1}^{n} \sum_{j = 1}^{m_{i}} \sum_{l = 1}^{C} y_{i j}^{(l)} \log ({\hat{p}}_{i j}^{(l)})

(5)

where

N

represents the total number of entire subword tokens,

m_{i}

denotes the number of subword tokens split from each word

w_{i}

C

is the total number of class labels, and

y_{i j}^{(l)}

is 1 if the class label

l

is the correct answer, and 0 otherwise. Additionally, when morphological information is used to resolve word ambiguity, a POS embedding layer is added, as shown in Equation (6):

P O S_{E m b e d d i n g_{i j}} = E m b e d d i n g (P O S_{i j})

(6)

where

P O S_{i j}

represents the morphological information for the subword token

t_{i j}

, which is propagated from the morphological information of the word

w_{i}

in the same way that the word’s sense label is propagated to its subword tokens.

3. Results

3.1. Experimental Environments and Metrics

To address the token unit mismatch issue between pre-trained models and fine-tuning training data, the method proposed in this paper was experimentally validated using several pre-trained models. These include the representative multilingual pre-trained model “bert-base-multilingual-cased” provided by Hugging Face, as well as the Korean-specific pre-trained models “kcbert-base” and “kcbert-large”. The “bert-base-multilingual-cased” model was pre-trained on Wikipedia text data from 104 languages, with 2.5 billion words, while the “kcbert-base” and “kcbert-large” models were pre-trained using Naver’s [29] Korean News corpus, which contains 110 million comments based on BERT’s base and large models.

To verify the effectiveness of the proposed method, it was compared with a commonly used subword token embedding integration method based on averaging. The datasets used for the experiments included the large dataset (SXLS2002104060.csv), containing 149,849 sentences, and the small dataset (NXLS2002104060_3.csv), containing 6782 sentences from the Sejong corpus. Analysis of the two datasets revealed that approximately 90% of the words had a single meaning, with the average number of meanings for polysemous words being around 2.8. The detailed statistics for both datasets are provided in Table 4.

For both datasets, the training data were split into an 80:20 ratio of training and test sets, with identical parameters applied to both datasets. The training parameters for each dataset are listed in Table 5. The maximum number of word tokens in the longest sentence of the training dataset is 99. When tokenized using the “kcbert” and “bert-base-multilingual-cased” models, the longest sentence resulted in 246 and 279 subword tokens, respectively. Therefore, the “Max Token Length” parameter was set to 256. The batch size was determined based on the memory capacity of the GPU used in the experiment—an NVIDIA GeForce RTX 4090 GPU with CUDA, manufactured in Taipei, Taiwan—as well as the dataset size. Accordingly, it was set to 16 for the small dataset and 32 for the large dataset.

For the learning rate, the preliminary experiments were conducted by gradually decreasing it from 1 × 10⁻⁴ to 5 × 10⁻⁵, …, 1 × 10⁻⁵. The results showed that 2 × 10⁻⁵ achieved the best performance on the 20% test set for both the small and large datasets. When further reduced, overfitting occurred, leading to a decline in performance. Hence, the learning rate was set to 2 × 10⁻⁵. The number of epochs was also determined based on preliminary training results for each dataset, ensuring an optimal configuration.

For evaluation, precision, recall, and the F1-score were used. Precision was calculated as the number of correctly predicted subword tokens divided by the total number of predicted subword tokens. Recall was defined as the ratio of correctly predicted subword tokens to the total number of true subword tokens in the test dataset. The F1-score was computed as the harmonic mean of precision and recall, as shown in Equation (7):

P r e c i s i o n = \frac{# o f c o r r e c t l y p r e d i c t e d s u b w o r d t o k e n s}{# o f p r e d i c t e d s u b w o r d t o k e n s} R e c a l l = \frac{# o f c o r r e c t l y p r e d i c t e d s u b w o r d t o k e n s}{# o f t r u e s u b w o r d t o k e n s} F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(7)

3.2. Experimental Results and Discussion

Table 6 and Table 7 present the results of various experiments on the small and large datasets from the Sejong corpus. The model labeled “Word Only” uses each BERT pre-trained model with only words as input data, while the model labeled “w/ POS” includes morphological information along with the words. Generally, the proposed model outperformed the models using subword pooling embeddings by about 3–5% in terms of the F1-score, regardless of the BERT pre-trained model or input data. Models utilizing morphological information showed a performance improvement of about 0.2–0.4%. Models fine-tuned with the large dataset typically showed a 0.2–0.5% increase in performance. However, considering that the monosemous word ratio in the large dataset is 0.3% higher than that in the small dataset, the performance difference is minimal, and the proposed model performed well even with the small dataset.

Regarding the precision, models using morphological information showed a precision range of 92.7% to 93.8% when using subword pooling embeddings. However, this is still relatively low, considering that around 90% of the training data consists of monosemous words. In contrast, the proposed model achieved a precision range of 97.6% to 98.4%, which is significantly higher, even when accounting for the monosemous word ratio. The lower performance of subword pooling embedding models can be attributed to their simple combination of subword embeddings to form word embeddings. This results in a dilution of the context of the most important subword tokens within a word, as the independent context information of subword tokens becomes mixed.

For models using subword pooling embeddings, the bert-base-multilingual-cased pre-trained model generally performed better than the kcbert-base and kcbert-large models. This is because the bert-base-multilingual-cased model is a multilingual pre-trained model with a vocabulary size of 119,547, while the kcbert-base and kcbert-large models, which are Korean-specific, each have a vocabulary size of 30,000. In other words, the embedding vector space of the bert-base-multilingual-cased model is about three or four times larger than that of the kcbert models, reducing overlap and interference in the embedding space when subword tokens are combined.

On the other hand, the proposed method performed better with the Korean-specific models kcbert-base and kcbert-large. This is because the proposed method assigns the same morphological and semantic labels to each subword token without merging their context information, preserving the context of consecutive subword tokens that form the original word. Similarly to how BERT’s original segment embeddings represent token affiliations within a sentence, the subword tokens split from the same word retain the same morphological and label information, which helps maintain word-level contextual relationships.

In summary, the proposed method effectively preserves contextual information in the embedding vectors despite the token unit mismatch problem. Since the training data consists exclusively of Korean sentences, the method performed better with the Korean-specific models, which construct the embedding space solely with Korean subword tokens.

Additionally, to verify whether the precision of the proposed model is statistically significantly higher than that of the subword pooling model, a one-tailed paired t-test was conducted using the precisions of all pre-trained models on both the small and large datasets, as presented in Table 6 and Table 7. To assess the impact of incorporating POS, a one-tailed paired t-test was also performed to compare the precisions of the proposed model with and without POS. The confidence intervals and p-values for each comparison are summarized in Table 8.

The results of the paired t-test indicate that for the proposed model with POS, the precision increase ranges from a minimum of 3% to a maximum of 7.1% within the 95% confidence interval, with p-values significantly smaller than 0.01. Therefore, even at a 99% confidence level, the improvement in precision achieved by the proposed model is statistically significant. Furthermore, the results demonstrate that incorporating POS into the input encoding of the proposed model leads to a statistically significant improvement in precision compared to the model without POS at the 95% confidence level.

For a more detailed analysis, a Receiver Operating Characteristic (ROC) analysis was conducted on all sense ID classes using the proposed model, with “kcbert-large” as the pre-trained model, which achieved the best performance. The analysis was performed on the large dataset from the Sejong corpus. Since WSD is a multi-class classification problem, each sense ID class was analyzed by treating the corresponding sense ID class as the true value and the remaining sense ID classes as the false value. As a result, the Area Under the Curve (AUC) values for the model using only word tokens and the model using both word and POS tokens for all classes were 0.992 and 0.994, respectively.

Additionally, an individual ROC analysis was performed for the top 20 most frequent sense ID classes, with the results presented in Figure 3 and Figure 4. The figures show the ROC curves, AUC values, and occurrence counts for the models using only word tokens and those using both word and POS tokens, respectively. The analysis showed that both models achieved a 100% perfect classification for sense ID classes outside the top 10 most frequent ones.

However, for the more frequently occurring sense ID classes, the classification error rate slightly increased. Notably, the classification performance of the most frequent sense ID class (sense ID = 1) significantly improved from 90% to 96% when the POS information was additionally used. In cases where a sense ID appears frequently, contextual lexical information often overlaps, making classification more challenging. However, by incorporating the POS information, the model learns to classify a word’s contextual information based on its POS, which appears to have contributed to the improved classification performance.

Adding the POS information does not significantly improve overall precision, with only about a 0.3–0.5% increase in performance. However, as shown in the above ROC analysis, the classification performance for the high-frequency sense ID classes improved significantly by approximately 6%. Additionally, we analyzed the extra computational cost incurred when incorporating the POS information. As presented in Table 9, experimental results showed that for all pre-trained models, and across both small and large datasets, computation time increased by only about 4%. Therefore, adding POS embeddings does not appear to pose a significant issue in terms of computational cost.

4. Conclusions

In this study, a method was proposed to effectively address the tokenization mismatch problem that arises during the fine-tuning of transfer learning using BERT and the Sejong corpus for Korean WSD. To resolve the subword token mismatch caused by tokenization, the proposed method preserves the contextual information of separated subword tokens while propagating the morphological and label information of the original word to the subword tokens. This information is then integrated through the attention mechanism.

For the experiments, the multilingual pre-trained model ’bert-base-multilingual-cased’ and the Korean-specific models ’kcbert-base’ and ’kcbert-large’ were used. To evaluate the effectiveness of the proposed method, its performance was compared with the most widely used CNN-based subword pooling method. The results showed that the proposed method outperformed subword pooling by approximately 3–5% in precision, recall, and F1 score, achieving over 98% precision. Notably, when additional morphological information was incorporated, the precision increased to 98.4%, demonstrating a highly competitive performance. Additionally, a method was also proposed to integrate the POS information of words into the subword-level input embeddings. The classification performance, measured by the AUC for the most frequent words’ first sense IDs, improved from 90% when using only word tokens to 96% when POS information was also utilized.

In conclusion, the method proposed in this paper overcomes the limitations of the existing subword pooling method by fully preserving the contextual information of subword tokens and enhancing the affiliation relationships at the word level through a novel approach. This is particularly beneficial for languages like Korean, which have complex morphological structures, and can also be applied to various downstream tasks based on transfer learning that leverage pre-trained contextual and lexical information from large corpora. While this research primarily focused on addressing the token mismatch problem in WSD as a downstream task itself, future studies will explore integrating the proposed model—which learns contextual information based on word senses rather than words—into practical applications such as information extraction and question answering.

Funding

This research was funded by Research Funds of Mokpo National University in 2022.

Data Availability Statement

The datasets used in this study can be downloaded from https://kcorpus.korean.go.kr/ (accessed on 24 January 2025). Since they are managed by the National Institute of Korean Language, an institution under the Ministry of Culture, Sports and Tourism of Korea, it is necessary to request access for research purposes and obtain permission before downloading.

Acknowledgments

This research was supported by the Research Funds of Mokpo National University in 2022.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lee, M.W. Semantic Relations from the Contextual Perspective. Korean Semant. 2019, 66, 101–120. [Google Scholar] [CrossRef]
Kwon, S.; Oh, D.; Ko, Y. Word sense disambiguation based on context selection using knowledge-based word similarity. Inf. Process. Manag. 2021, 58, 102551. [Google Scholar] [CrossRef]
Zhong, L.Y.; Wang, T.H. Towards word sense disambiguation using multiple kernel support vector machine. Int. J. Innov. Comput. Inf. Control. 2020, 16, 555–570. [Google Scholar]
Abraham, A.; Gupta, B.K.; Maurya, A.S.; Verma, S.B.; Husain, M.; Ali, A.; Alshmrany, S.; Gupta, S. Naïve Bayes Approach for Word Sense Disambiguation System with a Focus on Parts-of-Speech Ambiguity Resolution. IEEE Access 2024, 12, 126668–126678. [Google Scholar] [CrossRef]
Mir, T.A.; Lawaye, A.A. Naïve Bayes classifier for Kashmiri word sense disambiguation. Sādhanā 2024, 49, 226. [Google Scholar] [CrossRef]
Kim, M.; Kwon, H.C. Word sense disambiguation using prior probability estimation based on the Korean WordNet. Electronics 2021, 10, 2938. [Google Scholar] [CrossRef]
AlMousa, M.; Benlamri, R.; Khoury, R. A novel word sense disambiguation approach using WordNet knowledge graph. Comput. Speech Lang. 2022, 74, 101337. [Google Scholar] [CrossRef]
Park, J.Y.; Shin, H.J.; Lee, J.S. Word sense disambiguation using clustered sense labels. Appl. Sci. 2022, 12, 1857. [Google Scholar] [CrossRef]
Rahman, N.; Borah, B. An unsupervised method for word sense disambiguation. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 6643–6651. [Google Scholar] [CrossRef]
Waheeb, S.A.; Khan, N.A.; Shang, X. An efficient sentiment analysis based deep learning classification model to evaluate treatment quality. Malays. J. Comput. Sci. 2022, 35, 1–20. [Google Scholar] [CrossRef]
Haldorai, A.; Arulmurugan, R. Supervised, Unsupervised and Semi-Supervised Word Sense Disambiguation Approaches. Adv. Intell. Syst. Technol. 2022, 66–75. [Google Scholar] [CrossRef] [PubMed]
Abdelaali, B.; Tlili-Guiassa, Y. Swarm optimization for Arabic word sense disambiguation based on English pre-trained word embeddings. In Proceedings of the International Symposium on Informatics and Its Applications (ISIA) 2022, M’sila, Algeria, 29–30 November 2022; pp. 1–6. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
Arnett, C.; Bergen, B.K. Why do language models perform worse for morphologically complex languages? arXiv 2024, arXiv:2411.14198. [Google Scholar]
Ács, J.; Kádár, Á.; Kornai, A. Subword pooling makes a difference. arXiv 2021, arXiv:2102.10864. [Google Scholar]
Zhao, S.; You, F.; Chang, W.; Zhang, T.; Hu, M. Augment BERT with average pooling layer for Chinese summary generation. J. Intell. Fuzzy Syst. 2022, 42, 1859–1868. [Google Scholar] [CrossRef]
Jia, K. Sentiment classification of microblog: A framework based on BERT and CNN with attention mechanism. Comput. Electr. Eng. 2022, 101, 108032. [Google Scholar] [CrossRef]
Agarwal, S.; Fincke, S.; Jenkins, C.; Miller, S.; Boschee, E. Impact of subword pooling strategy on cross-lingual event detection. arXiv 2023, arXiv:2302.11365. [Google Scholar]
Sejong Corpus. Available online: https://kcorpus.korean.go.kr/ (accessed on 24 January 2025).
Ramshaw, L.A.; Marcus, M.P. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora; Springer: Dordrecht, The Netherlands, 1999; pp. 157–176. [Google Scholar]
Jeong, H. A Transfer Learning-Based Pairwise Information Extraction Framework Using BERT and Korean-Language Modification Relationships. Symmetry 2024, 16, 136. [Google Scholar] [CrossRef]
Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A survey on Named Entity Recognition—Datasets, tools, and methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
BERT Multilingual Base Model. Available online: https://huggingface.co/bert-base-multilingual-cased (accessed on 24 January 2025).
Lee, J. Kcbert: Korean Comments Bert. In Proceedings of the Annual Conference on Human and Language Technology, Lisbon, Portugal, 3–5 November 2020; pp. 437–440. [Google Scholar]
Korean Comments BERT Base Model. Available online: https://huggingface.co/beomi/kcbert-base (accessed on 24 January 2025).
Korean Comments BERT Large Model. Available online: https://huggingface.co/beomi/kcbert-large (accessed on 24 January 2025).
Sennrich, R. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
Naver News. Available online: https://news.naver.com/ (accessed on 24 January 2025).

Figure 1. BERT-based input and output embedding representations for the proposed method.

Figure 2. Overall architecture of the BERT-based proposed model.

Figure 3. ROC analysis results for the top 20 most frequent sense ID classes using the proposed model with word tokens only.

Figure 4. ROC analysis results for the top 20 most frequent sense ID classes using the proposed model with both word and POS tokens.

Table 1. List of morphological tags.

Tag	Morpheme
NNG	Common Noun
NNP	Proper Noun
NNB	Dependent Noun
NP	Pronoun
NR	Numeral
VV	Verb
VA	Adjective
VX	Auxiliary Verb
VCP	Positive Copula
VCN	Negative Copula
XR	Root

Table 2. Example of words, word meanings, and morpheme tagging in the Sejong corpus.

sent_form	Word	sense_id	POS
서울 청계천의 명물 ‘디지털캔버스’가 새 옷으로 갈아입는다. (The iconic “Digital Canvas” of Seoul’s Cheonggyecheon is getting a makeover.)	서울 (Seoul)	2	NNP ¹
	청계천 (Cheonggyecheon)	1	NNP ¹
	명물 (Icon)	1	NNG ²
	디지털 (Digital)	1	NNG ²
	캔버스 (Canvas)	1	NNG ²
	옷 (Outfit)	1	NNG ²
	갈아입 (Changing into)	1	VV ³

¹ NNP denotes Proper Noun; ² NNG denotes General Noun; ³ VV denotes Verb.

Table 3. Example of tokenization using the tokenizer of BERT-based pre-trained models.

Tokenizer	Tokens
bert-base-multilingual-cased	[‘서울_Seoul’, ‘청_blue’, ‘##계_lineage’, ‘##천_sky’, ‘##의_of’, ‘명_name’, ‘##물_water’, “‘”, ‘디지털_Digital’, ‘##캔_Can’, ‘##버스_bus’, “‘”, ‘가_is’, ‘새_new’, ‘옷_clothes’, ‘##으로_into’, ‘갈_go’, ‘##아_to’, ‘##입_enter’, ‘##는다_ing’, ‘.’]
kcbert-base/kcbert-large	[‘서울_Seoul’, ‘청_blue’, ‘##계_lineage’, ‘##천_sky’, ‘##의_of’, ‘명_name’, ‘##물_water’, “‘”, ‘디지_Digi’, ‘##털_fur’, ‘##캔_Can’, ‘##버스_bus’, “‘”, ‘가_is’, ‘새_new’, ‘옷_clothes’, ‘##으로_into’, ‘갈아_change’, ‘##입_enter’, ‘##는다_ing’, ‘.’]

The symbol “##” is used in BERT’s WordPiece tokenizer to indicate a sequence of continuous WordPiece tokens when a word is split into multiple WordPiece tokens.

Table 4. Dataset statistics.

Statistics	Small Dataset	Large Dataset
Number of Sentences	6782	149,849
Number of Unique Words	14,551	76,970
Number of Senses per Word (Min, Max)	(1, 26)	(1, 41)
Monosemous Word Ratio	90.4%	90.7%
Average Number of Senses for Polysemous Words	2.74	2.79

Table 5. Training parameters.

Parameter	Small Dataset	Large Dataset
Max Token Length	256	256
Batch Size	16	32
Learning Rate	2 × 10⁻⁵	2 × 10⁻⁵
Number of Epochs	5	3

Table 6. Experimental results on the small dataset of the Sejong corpus.

Model	Subword Pooling Model			Proposed Model
Metrics	Precision	Recall	F1-Score	Precision	Recall	F1-Score
bert-base-multilingual-cased
Word Only	93.0	96.1	94.5	96.8	98.4	97.6
w/ POS	93.5	96.1	94.8	97.6	98.4	98.0
kcbert-base
Word Only	92.4	95.5	93.9	97.5	98.8	98.1
w/ POS	92.7	95.5	94.1	98.0	98.8	98.4
kcbert-large
Word Only	92.3	95.5	93.9	97.5	98.8	98.1
w/ POS	92.7	95.5	94.1	98.1	98.8	98.4

Table 7. Experimental results on the large dataset of the Sejong corpus.

Model/	Subword Pooling Model			Proposed Model
Metrics	Precision	Recall	F1-Score	Precision	Recall	F1-Score
bert-base-multilingual-cased
Word Only	93.6	95.9	94.7	97.4	98.4	97.9
w/ POS	93.8	96.1	94.9	97.9	98.4	98.1
kcbert-base
Word Only	92.6	95.5	94.0	98.0	98.7	98.3
w/ POS	92.7	95.5	94.1	98.3	98.7	98.5
kcbert-large
Word Only	92.9	95.4	94.1	98.1	98.7	98.4
w/ POS	93.0	95.5	94.2	98.4	98.7	98.5

Table 8. Results of the one-tailed paired t-test for the proposed model and the subword pooling model with different encodings.

Comparison Models	t-Statistic	Confidence Interval (95%)	p-Value
Subword Pooling Model w/ Only Word vs. Proposed Model w/ Only Word	−9.537	−0.070–−0.026	0.005
Subword Pooling Model w/ Word and POS vs. Proposed Model w/ Word & POS	−10.704	−0.071–−0.030	0.004
Proposed Model w/ Only Word vs. Proposed Model w/ Word and POS	−5.5	−0.007–−0.001	0.016

Table 9. Comparison of execution time for each pre-trained model and dataset between the word-only model and the word and POS model.

Pre-Trained Model	Dataset	Model	Avg. Training Time Per Epoch (Seconds)	Increase Rate (%)
bert-base-multilingual-cased	Small	w/ Only Word	533	4.13
	Small	w/ Word and POS	555	4.13
	Large	w/ Only Word	9216	4.24
	Large	w/ Word and POS	9607	4.24
kcbert-base	Small	w/ Only Word	497	4.02
	Small	w/ Word and POS	517	4.02
	Large	w/ Only Word	8571	4.15
	Large	w/ Word and POS	8927	4.15
kcbert-large	Small	w/ Only Word	1352	3.92
	Small	w/ Word and POS	1405	3.92
	Large	w/ Only Word	23,833	4.07
	Large	w/ Word and POS	24,855	4.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jeong, H. A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics 2025, 13, 864. https://doi.org/10.3390/math13050864

AMA Style

Jeong H. A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics. 2025; 13(5):864. https://doi.org/10.3390/math13050864

Chicago/Turabian Style

Jeong, Hanjo. 2025. "A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT" Mathematics 13, no. 5: 864. https://doi.org/10.3390/math13050864

APA Style

Jeong, H. (2025). A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics, 13(5), 864. https://doi.org/10.3390/math13050864

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT

Abstract

1. Introduction

2. Materials and Methods

2.1. Sejong Corpus Dataset

2.2. Input Embeddings Using POS and Label Propagation

2.3. Attention-Based Fine-Tuning Model Using Context-Preserving Subword Label Assignment

3. Results

3.1. Experimental Environments and Metrics

3.2. Experimental Results and Discussion

4. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI