[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago

  • Conference paper
  • First Online:
Chinese Lexical Semantics (CLSW 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13250))

Included in the following conference series:

  • 457 Accesses

Abstract

In recent years, there has been numerous mature research on Chinese word segmentation (CWS). However, the existing research mainly focuses on mainland Mandarin word segmentation, and the research on CWS of other countries/regions is still far from satisfactory. Although Chinese in the Malay Archipelago countries and Mandarin in mainland China are homologous, there exist some differences between them during their respective development processes. Therefore, common CWS tools cannot accurately and effectively segment Chinese texts of different countries. This paper conducts research on the Chinese texts of five countries in Malay Archipelago (Indonesia, Malaysia, Brunei, Singapore, and the Philippines), builds five CWS datasets respectively for each country, and explores the performance of some advanced word segmentation tools and sequence labeling models on the constructed datasets. The experimental results show the effectiveness of BERT (Bidirectional Encoder Representations from Transformers) model in CWS task, providing a baseline for CWS in five Malay Archipelago countries. Furthermore, we explore the enhancement of two training strategies on CWS task, and the experimental results show that these two strategies cannot significantly improve the CWS performance of Malay Archipelago. Besides, in view of the different performances on CWS of different countries, we deeply analyze their objective and historical reasons. The reasons behind it mainly fall into the corpus size, the Chinese language policy and the language education norms on different Malay Archipelago countries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 55.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 69.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/fxsjy/jieba.

  2. 2.

    http://qwgzyj.gqb.gov.cn/hwjy/127/382.shtml.

  3. 3.

    https://github.com/chakki-works/seqeval.

References

  1. Luo, R., Xu, J., Zhang, Y., Ren, X., Sun, X.: PKUSEG: A Toolkit for Multi-Domain Chinese Word Segmentation. CoRR (2019)

    Google Scholar 

  2. Chen, X., Qiu, X, Zhu, C., Liu, P., Huang, X.: Long short-term memory neural networks for Chinese word segmentation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1197–1206 (2015)

    Google Scholar 

  3. Zhang, M., Zhang, Y., Fu, G.: Transition-based neural word segmentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 421–431 (2016)

    Google Scholar 

  4. Cai, D., Zhao, H.: Neural word segmentation learning for Chinese. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 409–420 (2016)

    Google Scholar 

  5. Shao, Y., Hardmeier, C., Tiedemann, J., Nivre, J.: Character-based joint segmentation and POS tagging for Chinese using bidirectional RNN-CRF. In: Proceedings of the Conference on the Eighth International Joint Conference on Natural Language Processing, pp. 173–183 (2017)

    Google Scholar 

  6. Dai, F.Z., Cai, Z.: Glyph-aware embedding of Chinese characters. In: Proceedings of the First Workshop on Subword and Character Level Models in NLP, pp. 64–69 (2017)

    Google Scholar 

  7. Zhu, Y., Li, Z.H., Huang, D.P., Zhang, M.: Domain adaptation for Chinese word segmentation using partial annotations. J. Chinese Inf. Process. 33(09), 1–8 (2019)

    Google Scholar 

  8. Zhu, Y.: Research on Domain Adaptation of Chinese Word Segmentation with Multi-source Features and Data. Soochow University (2019)

    Google Scholar 

  9. Li, Z.X.: Research on Chinese Word Segmentation Methods Using Context Information. Beijing Jiaotong University (2018)

    Google Scholar 

  10. Yang, S.C.: Research on the Methods of Ancient Chinese Word Segmentation and Part-of-speech Tagging. North China University of Science and Technology (2018)

    Google Scholar 

  11. Li, X.Y.: Study on word segmentation in ancient texts based on neologism discovery and dictionary information. Softw. Guide 18(04), 60–63 (2019)

    Google Scholar 

  12. Hua, Z.H.: The word segmentation norms of Buddhist documents in the medieval Chinese corpus. J. Southeast Univ. (Philos. Soc. Sci.) 21(01), 135–142+145 (2019)

    Google Scholar 

  13. Yao, L., et al.: Word segmentation for chinese judicial documents. In: Cheng, X., Jing, W., Song, X., Lu, Z. (eds.) Data Science. CCIS, vol. 1058, pp. 466–478. Springer, Singapore (2019). https://doi.org/10.1007/978-981-15-0118-0_36

    Chapter  Google Scholar 

  14. Xiong, Y., Wang, Z., Jiang, D., Qingcai, X., Hua, C.: A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text. BMC Med. Inform. Decis. Mak. 19(2), 179–184 (2019)

    Google Scholar 

  15. Xing, J., Zhu, K., Zhang, S.: Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 3619–3630 (2018)

    Google Scholar 

  16. Mao, Y.: Research of Chinese Word Segmentation and Sentence Similarity on Traditional Chinese Medecine Symptom. Zhejiang University (2017)

    Google Scholar 

  17. Huang, K., Huang, D., Liu, Z., Mo, F.: A joint multiple criteria model in transfer learning for cross-domain Chinese word segmentation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 3873–3882 (2020)

    Google Scholar 

  18. Xu, J., Sun, X.: Transfer Deep Learning for Low-Resource Chinese Word Segmentation with a Novel Neural Network. CoRR (2017)

    Google Scholar 

  19. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR (2015)

    Google Scholar 

  20. Cui, L., Zhang, Y.: Hierarchically-refined label attention network for sequence labelling. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 4115–4128 (2019)

    Google Scholar 

  21. Devlin, J., Chang, M.W., Chang Lee, K., Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR (2018)

    Google Scholar 

  22. Che, W., Feng, Y., Qin, L., Liu, T.: N-LTP: A Open-source Neural Chinese Language Technology Platform with Pretrained Models. CoRR (2020)

    Google Scholar 

  23. Zhang, H., Yu, H., Xiong, D., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the second SIGHAN workshop on Chinese language processing. Association for Computational Linguistics, pp. 184–187 (2013)

    Google Scholar 

  24. Thomas, E.: The second international Chinese word segmentation bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 123–133 (2020)

    Google Scholar 

  25. Zhang, Y.N.: Research on Domain Adaptation Method for Chinese Segmentation Based on Instance Transfer Learning. Beijing Jiaotong University (2019)

    Google Scholar 

  26. Wen, A.D.: Investigation and Research on the Status Quo of Chinese Education in Brunei. Shaanxi Normal University (2015)

    Google Scholar 

  27. Zhang, Z.G., Guo, C.X.: Research on Brunei’s language policy and its enlightenment to China. J. Xi’an Int. Stud. Univ. 24(03), 28–31 (2016)

    Google Scholar 

  28. Sui, R.S.: The Analysis of the Chinese Education in Philippines During the Marcos. Fujian Normal University (2016)

    Google Scholar 

  29. Fan, J. J.: A Study on the current situation of Chinese language Education in the Philippines. Xi’an Shiyou University (2020)

    Google Scholar 

  30. Zhou, W., An, D.: On code-switching in mandarin conversations of ethnic Chinese in Malaysia, Indonesia and Northern Thailand. J. Yibin Univ. 19(03), 88–94 (2019)

    Google Scholar 

  31. Zhang, L.C.: Analysis of the language situation in Indonesian Chinese newspapers and periodicals. Overseas Chinese Educ. 2010(01), 33–39 (2010)

    Google Scholar 

  32. Ke, Y.H.: The contrast of educational policies for Chinese language teaching in Malaysia and Singapore. Around Southeast Asia 2009(10), 48–52 (2009)

    Google Scholar 

Download references

Acknowledgement

This work was supported by the National Natural Science Foundation of China (No. 61572145), the Major Projects of Guangdong Education Department for Foundation Research and Applied Research (No. 2017KZDXM031) and National Social Science Foundation of China (No. 17CTQ045). The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nankai Lin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jiang, S., Fu, Y., Lin, N. (2022). Construction and Evaluation of Chinese Word Segmentation Datasets in Malay Archipelago. In: Dong, M., Gu, Y., Hong, JF. (eds) Chinese Lexical Semantics. CLSW 2021. Lecture Notes in Computer Science(), vol 13250. Springer, Cham. https://doi.org/10.1007/978-3-031-06547-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-06547-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-06546-0

  • Online ISBN: 978-3-031-06547-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics