A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT
<p>BERT-based input and output embedding representations for the proposed method.</p> "> Figure 2
<p>Overall architecture of the BERT-based proposed model.</p> "> Figure 3
<p>ROC analysis results for the top 20 most frequent sense ID classes using the proposed model with word tokens only.</p> "> Figure 4
<p>ROC analysis results for the top 20 most frequent sense ID classes using the proposed model with both word and POS tokens.</p> ">
Abstract
:1. Introduction
- A novel BERT-based approach is proposed to address the token mismatch problem that can occur when using transfer learning-based models for natural language processing of languages with complex morphological structures, like Korean. This problem arises from the discrepancies between the tokens of pre-trained models and fine-tuning training data.
- The proposed model introduces specific embedding methods and attention weight learning methods to enable the model to learn the relationships between words and subword tokens through an attention mechanism. Furthermore, a method for incorporating POS (Part-of-Speech) information into the input embedding is also proposed, which is useful for WSD.
- To evaluate the effectiveness of the proposed method, the Sejong corpus, a representative dataset tagged by language experts was used for implementation, and various experiments were conducted. The results showed an improvement of approximately 3–5% over the existing subword embedding integration methods.
2. Materials and Methods
2.1. Sejong Corpus Dataset
2.2. Input Embeddings Using POS and Label Propagation
2.3. Attention-Based Fine-Tuning Model Using Context-Preserving Subword Label Assignment
3. Results
3.1. Experimental Environments and Metrics
3.2. Experimental Results and Discussion
4. Conclusions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Lee, M.W. Semantic Relations from the Contextual Perspective. Korean Semant. 2019, 66, 101–120. [Google Scholar] [CrossRef]
- Kwon, S.; Oh, D.; Ko, Y. Word sense disambiguation based on context selection using knowledge-based word similarity. Inf. Process. Manag. 2021, 58, 102551. [Google Scholar] [CrossRef]
- Zhong, L.Y.; Wang, T.H. Towards word sense disambiguation using multiple kernel support vector machine. Int. J. Innov. Comput. Inf. Control. 2020, 16, 555–570. [Google Scholar]
- Abraham, A.; Gupta, B.K.; Maurya, A.S.; Verma, S.B.; Husain, M.; Ali, A.; Alshmrany, S.; Gupta, S. Naïve Bayes Approach for Word Sense Disambiguation System with a Focus on Parts-of-Speech Ambiguity Resolution. IEEE Access 2024, 12, 126668–126678. [Google Scholar] [CrossRef]
- Mir, T.A.; Lawaye, A.A. Naïve Bayes classifier for Kashmiri word sense disambiguation. Sādhanā 2024, 49, 226. [Google Scholar] [CrossRef]
- Kim, M.; Kwon, H.C. Word sense disambiguation using prior probability estimation based on the Korean WordNet. Electronics 2021, 10, 2938. [Google Scholar] [CrossRef]
- AlMousa, M.; Benlamri, R.; Khoury, R. A novel word sense disambiguation approach using WordNet knowledge graph. Comput. Speech Lang. 2022, 74, 101337. [Google Scholar] [CrossRef]
- Park, J.Y.; Shin, H.J.; Lee, J.S. Word sense disambiguation using clustered sense labels. Appl. Sci. 2022, 12, 1857. [Google Scholar] [CrossRef]
- Rahman, N.; Borah, B. An unsupervised method for word sense disambiguation. J. King Saud Univ. Comput. Inf. Sci. 2022, 34, 6643–6651. [Google Scholar] [CrossRef]
- Waheeb, S.A.; Khan, N.A.; Shang, X. An efficient sentiment analysis based deep learning classification model to evaluate treatment quality. Malays. J. Comput. Sci. 2022, 35, 1–20. [Google Scholar] [CrossRef]
- Haldorai, A.; Arulmurugan, R. Supervised, Unsupervised and Semi-Supervised Word Sense Disambiguation Approaches. Adv. Intell. Syst. Technol. 2022, 66–75. [Google Scholar] [CrossRef] [PubMed]
- Abdelaali, B.; Tlili-Guiassa, Y. Swarm optimization for Arabic word sense disambiguation based on English pre-trained word embeddings. In Proceedings of the International Symposium on Informatics and Its Applications (ISIA) 2022, M’sila, Algeria, 29–30 November 2022; pp. 1–6. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Arnett, C.; Bergen, B.K. Why do language models perform worse for morphologically complex languages? arXiv 2024, arXiv:2411.14198. [Google Scholar]
- Ács, J.; Kádár, Á.; Kornai, A. Subword pooling makes a difference. arXiv 2021, arXiv:2102.10864. [Google Scholar]
- Zhao, S.; You, F.; Chang, W.; Zhang, T.; Hu, M. Augment BERT with average pooling layer for Chinese summary generation. J. Intell. Fuzzy Syst. 2022, 42, 1859–1868. [Google Scholar] [CrossRef]
- Jia, K. Sentiment classification of microblog: A framework based on BERT and CNN with attention mechanism. Comput. Electr. Eng. 2022, 101, 108032. [Google Scholar] [CrossRef]
- Agarwal, S.; Fincke, S.; Jenkins, C.; Miller, S.; Boschee, E. Impact of subword pooling strategy on cross-lingual event detection. arXiv 2023, arXiv:2302.11365. [Google Scholar]
- Sejong Corpus. Available online: https://kcorpus.korean.go.kr/ (accessed on 24 January 2025).
- Ramshaw, L.A.; Marcus, M.P. Text chunking using transformation-based learning. In Natural Language Processing Using Very Large Corpora; Springer: Dordrecht, The Netherlands, 1999; pp. 157–176. [Google Scholar]
- Jeong, H. A Transfer Learning-Based Pairwise Information Extraction Framework Using BERT and Korean-Language Modification Relationships. Symmetry 2024, 16, 136. [Google Scholar] [CrossRef]
- Jehangir, B.; Radhakrishnan, S.; Agarwal, R. A survey on Named Entity Recognition—Datasets, tools, and methodologies. Nat. Lang. Process. J. 2023, 3, 100017. [Google Scholar] [CrossRef]
- BERT Multilingual Base Model. Available online: https://huggingface.co/bert-base-multilingual-cased (accessed on 24 January 2025).
- Lee, J. Kcbert: Korean Comments Bert. In Proceedings of the Annual Conference on Human and Language Technology, Lisbon, Portugal, 3–5 November 2020; pp. 437–440. [Google Scholar]
- Korean Comments BERT Base Model. Available online: https://huggingface.co/beomi/kcbert-base (accessed on 24 January 2025).
- Korean Comments BERT Large Model. Available online: https://huggingface.co/beomi/kcbert-large (accessed on 24 January 2025).
- Sennrich, R. Neural machine translation of rare words with subword units. arXiv 2015, arXiv:1508.07909. [Google Scholar]
- Naver News. Available online: https://news.naver.com/ (accessed on 24 January 2025).
Tag | Morpheme |
---|---|
NNG | Common Noun |
NNP | Proper Noun |
NNB | Dependent Noun |
NP | Pronoun |
NR | Numeral |
VV | Verb |
VA | Adjective |
VX | Auxiliary Verb |
VCP | Positive Copula |
VCN | Negative Copula |
XR | Root |
sent_form | Word | sense_id | POS |
---|---|---|---|
서울 청계천의 명물 ‘디지털캔버스’가 새 옷으로 갈아입는다. (The iconic “Digital Canvas” of Seoul’s Cheonggyecheon is getting a makeover.) | 서울 (Seoul) | 2 | NNP 1 |
청계천 (Cheonggyecheon) | 1 | NNP 1 | |
명물 (Icon) | 1 | NNG 2 | |
디지털 (Digital) | 1 | NNG 2 | |
캔버스 (Canvas) | 1 | NNG 2 | |
옷 (Outfit) | 1 | NNG 2 | |
갈아입 (Changing into) | 1 | VV 3 |
Tokenizer | Tokens |
---|---|
bert-base-multilingual-cased | [‘서울_Seoul’, ‘청_blue’, ‘##계_lineage’, ‘##천_sky’, ‘##의_of’, ‘명_name’, ‘##물_water’, “‘”, ‘디지털_Digital’, ‘##캔_Can’, ‘##버스_bus’, “‘”, ‘가_is’, ‘새_new’, ‘옷_clothes’, ‘##으로_into’, ‘갈_go’, ‘##아_to’, ‘##입_enter’, ‘##는다_ing’, ‘.’] |
kcbert-base/kcbert-large | [‘서울_Seoul’, ‘청_blue’, ‘##계_lineage’, ‘##천_sky’, ‘##의_of’, ‘명_name’, ‘##물_water’, “‘”, ‘디지_Digi’, ‘##털_fur’, ‘##캔_Can’, ‘##버스_bus’, “‘”, ‘가_is’, ‘새_new’, ‘옷_clothes’, ‘##으로_into’, ‘갈아_change’, ‘##입_enter’, ‘##는다_ing’, ‘.’] |
Statistics | Small Dataset | Large Dataset |
---|---|---|
Number of Sentences | 6782 | 149,849 |
Number of Unique Words | 14,551 | 76,970 |
Number of Senses per Word (Min, Max) | (1, 26) | (1, 41) |
Monosemous Word Ratio | 90.4% | 90.7% |
Average Number of Senses for Polysemous Words | 2.74 | 2.79 |
Parameter | Small Dataset | Large Dataset |
---|---|---|
Max Token Length | 256 | 256 |
Batch Size | 16 | 32 |
Learning Rate | 2 × 10−5 | 2 × 10−5 |
Number of Epochs | 5 | 3 |
Model | Subword Pooling Model | Proposed Model | ||||
---|---|---|---|---|---|---|
Metrics | Precision | Recall | F1-Score | Precision | Recall | F1-Score |
bert-base-multilingual-cased | ||||||
Word Only | 93.0 | 96.1 | 94.5 | 96.8 | 98.4 | 97.6 |
w/ POS | 93.5 | 96.1 | 94.8 | 97.6 | 98.4 | 98.0 |
kcbert-base | ||||||
Word Only | 92.4 | 95.5 | 93.9 | 97.5 | 98.8 | 98.1 |
w/ POS | 92.7 | 95.5 | 94.1 | 98.0 | 98.8 | 98.4 |
kcbert-large | ||||||
Word Only | 92.3 | 95.5 | 93.9 | 97.5 | 98.8 | 98.1 |
w/ POS | 92.7 | 95.5 | 94.1 | 98.1 | 98.8 | 98.4 |
Model/ | Subword Pooling Model | Proposed Model | ||||
---|---|---|---|---|---|---|
Metrics | Precision | Recall | F1-Score | Precision | Recall | F1-Score |
bert-base-multilingual-cased | ||||||
Word Only | 93.6 | 95.9 | 94.7 | 97.4 | 98.4 | 97.9 |
w/ POS | 93.8 | 96.1 | 94.9 | 97.9 | 98.4 | 98.1 |
kcbert-base | ||||||
Word Only | 92.6 | 95.5 | 94.0 | 98.0 | 98.7 | 98.3 |
w/ POS | 92.7 | 95.5 | 94.1 | 98.3 | 98.7 | 98.5 |
kcbert-large | ||||||
Word Only | 92.9 | 95.4 | 94.1 | 98.1 | 98.7 | 98.4 |
w/ POS | 93.0 | 95.5 | 94.2 | 98.4 | 98.7 | 98.5 |
Comparison Models | t-Statistic | Confidence Interval (95%) | p-Value |
---|---|---|---|
Subword Pooling Model w/ Only Word vs. Proposed Model w/ Only Word | −9.537 | −0.070–−0.026 | 0.005 |
Subword Pooling Model w/ Word and POS vs. Proposed Model w/ Word & POS | −10.704 | −0.071–−0.030 | 0.004 |
Proposed Model w/ Only Word vs. Proposed Model w/ Word and POS | −5.5 | −0.007–−0.001 | 0.016 |
Pre-Trained Model | Dataset | Model | Avg. Training Time Per Epoch (Seconds) | Increase Rate (%) |
---|---|---|---|---|
bert-base-multilingual-cased | Small | w/ Only Word | 533 | 4.13 |
w/ Word and POS | 555 | |||
Large | w/ Only Word | 9216 | 4.24 | |
w/ Word and POS | 9607 | |||
kcbert-base | Small | w/ Only Word | 497 | 4.02 |
w/ Word and POS | 517 | |||
Large | w/ Only Word | 8571 | 4.15 | |
w/ Word and POS | 8927 | |||
kcbert-large | Small | w/ Only Word | 1352 | 3.92 |
w/ Word and POS | 1405 | |||
Large | w/ Only Word | 23,833 | 4.07 | |
w/ Word and POS | 24,855 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jeong, H. A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics 2025, 13, 864. https://doi.org/10.3390/math13050864
Jeong H. A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics. 2025; 13(5):864. https://doi.org/10.3390/math13050864
Chicago/Turabian StyleJeong, Hanjo. 2025. "A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT" Mathematics 13, no. 5: 864. https://doi.org/10.3390/math13050864
APA StyleJeong, H. (2025). A Context-Preserving Tokenization Mismatch Resolution Method for Korean Word Sense Disambiguation Based on the Sejong Corpus and BERT. Mathematics, 13(5), 864. https://doi.org/10.3390/math13050864