Abstract
In recent years, there has been numerous mature research on Chinese word segmentation (CWS). However, the existing research mainly focuses on mainland Mandarin word segmentation, and the research on CWS of other countries/regions is still far from satisfactory. Although Chinese in the Malay Archipelago countries and Mandarin in mainland China are homologous, there exist some differences between them during their respective development processes. Therefore, common CWS tools cannot accurately and effectively segment Chinese texts of different countries. This paper conducts research on the Chinese texts of five countries in Malay Archipelago (Indonesia, Malaysia, Brunei, Singapore, and the Philippines), builds five CWS datasets respectively for each country, and explores the performance of some advanced word segmentation tools and sequence labeling models on the constructed datasets. The experimental results show the effectiveness of BERT (Bidirectional Encoder Representations from Transformers) model in CWS task, providing a baseline for CWS in five Malay Archipelago countries. Furthermore, we explore the enhancement of two training strategies on CWS task, and the experimental results show that these two strategies cannot significantly improve the CWS performance of Malay Archipelago. Besides, in view of the different performances on CWS of different countries, we deeply analyze their objective and historical reasons. The reasons behind it mainly fall into the corpus size, the Chinese language policy and the language education norms on different Malay Archipelago countries.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Luo, R., Xu, J., Zhang, Y., Ren, X., Sun, X.: PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. CoRR (2019)
Chen, X., Qiu, X, Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206 (2015)
Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 421–431 (2016)
Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409–420 (2016)
Shao, Y., Hardmeier, C., Tiedemann, J., Nivre, J.: Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. In: Proceedings of the Conference on the Eighth International Joint Conference on Natural Language Processing, pp. 173–183 (2017)
Dai, F.Z., Cai, Z.: Glyph-aware embedding of Chinese characters. In: Proceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 64–69 (2017)
Zhu, Y., Li, Z.H., Huang, D.P., Zhang, M.: Domain adaptation for Chinese word segmentation using partial annotations. J. Chinese Inf. Process. 33(09), 1–8 (2019)
Zhu, Y.: Research on Domain Adaptation of Chinese Word Segmentation with Multi-source Features and Data. Soochow University (2019)
Li, Z.X.: Research on Chinese Word Segmentation Methods Using Context Information. Beijing Jiaotong University (2018)
Yang, S.C.: Research on the Methods of Ancient Chinese Word Segmentation and Part-of-speech Tagging. North China University of Science and Technology (2018)
Li, X.Y.: Study on word segmentation in ancient texts based on neologism discovery and dictionary information. Softw. Guide 18(04), 60–63 (2019)
Hua, Z.H.: The word segmentation norms of Buddhist documents in the medieval Chinese corpus. J. Southeast Univ. (Philos. Soc. Sci.) 21(01), 135–142+145 (2019)
Yao, L., et al.: Word segmentation for chinese judicial documents. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds.) Data Science. CCIS, vol. 1058, pp. 466–478. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-0118-0_36
Xiong, Y., Wang, Z., Jiang, D., Qingcai, X., Hua, C.: A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Med. Inform. Decis. Mak. 19(2), 179–184 (2019)
Xing, J., Zhu, K., Zhang, S.: Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3619–3630 (2018)
Mao, Y.: Research of Chinese Word Segmentation and Sentence Similarity on Traditional Chinese Medecine Symptom. Zhejiang University (2017)
Huang, K., Huang, D., Liu, Z., Mo, F.: A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 3873–3882 (2020)
Xu, J., Sun, X.: Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network. CoRR (2017)
Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR (2015)
Cui, L., Zhang, Y.: Hierarchically-refined label attention network for sequence labelling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4115–4128 (2019)
Devlin, J., Chang, M.W., Chang Lee, K., Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR (2018)
Che, W., Feng, Y., Qin, L., Liu, T.: N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models. CoRR (2020)
Zhang, H., Yu, H., Xiong, D., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the second SIGHAN workshop on Chinese language processing. Association for Computational Linguistics, pp. 184–187 (2013)
Thomas, E.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 123–133 (2020)
Zhang, Y.N.: Research on Domain Adaptation Method for Chinese Segmentation Based on Instance Transfer Learning. Beijing Jiaotong University (2019)
Wen, A.D.: Investigation and Research on the Status Quo of Chinese Education in Brunei. Shaanxi Normal University (2015)
Zhang, Z.G., Guo, C.X.: Research on Brunei’s language policy and its enlightenment to China. J. Xi’an Int. Stud. Univ. 24(03), 28–31 (2016)
Sui, R.S.: The Analysis of the Chinese Education in Philippines During the Marcos. Fujian Normal University (2016)
Fan, J. J.: A Study on the current situation of Chinese language Education in the Philippines. Xi’an Shiyou University (2020)
Zhou, W., An, D.: On code-switching in mandarin conversations of ethnic Chinese in Malaysia, Indonesia and Northern Thailand. J. Yibin Univ. 19(03), 88–94 (2019)
Zhang, L.C.: Analysis of the language situation in Indonesian Chinese newspapers and periodicals. Overseas Chinese Educ. 2010(01), 33–39 (2010)
Ke, Y.H.: The contrast of educational policies for Chinese language teaching in Malaysia and Singapore. Around Southeast Asia 2009(10), 48–52 (2009)
Acknowledgement
This work was supported by the National Natural Science Foundation of China (No. 61572145), the Major Projects of Guangdong Education Department for Foundation Research and Applied Research (No. 2017KZDXM031) and National Social Science Foundation of China (No. 17CTQ045). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, S., Fu, Y., Lin, N. (2022). Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago. In: Dong, M., Gu, Y., Hong, JF. (eds) Chinese Lexical Semantics. CLSW 2021. Lecture Notes in Computer Science(), vol 13250. Springer, Cham. https://doi.org/10.1007/978-3-031-06547-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-06547-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06546-0
Online ISBN: 978-3-031-06547-7
eBook Packages: Computer ScienceComputer Science (R0)