Abstract
Machine learning based approach is the most important one among Chinese word segmentation, and feature selection is very crucial to it. This paper overviews feature sets used in a few machine learning based approaches for Chinese word segmentation closed task. The comparison of these feature sets is made on the SIGHAN corpora and the same machine learning framework – maximum entropy model. Based on this model, two new efficiency measures are presented, i.e. the numbers of unique events and predicates. The experimental results for the impacts of feature sets on effectiveness and efficiency are shown, according to which the suggestion for feature selection is presented when building machine learning based Chinese word segmentation system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Xue, N., Shen, L.: Chinese word segmentation as LMR tagging. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, vol. 17. Association for Computational Linguistics, pp. 176–179 (2003)
Wang, Z., Huang, C., Zhu, J.: The Character-based CRF segmenter of MSRA&NEU for the 4th Bakeoff. In: IJCNLP, pp. 98–101 (2008)
Fang, Y., Li, Z.W.S., et al.: Soochow university word segmenter for SIGHAN 2012 bakeoff. In: CLP, pp. 47–50 (2012)
Tseng, H., Chang, P., Andrew, G., et al.: A conditional random field word segmenter for SIGHAN bakeoff. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 168–171 (2005)
Low, J.K., Ng, H.T., Guo, W.: A maximum entropy approach to Chinese word segmentation. In: Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, pp. 448–451 (2005)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 562–568 (2004)
Zhao, H., Huang, C., Li, M., et al.: Effective tag set selection in chinese word segmentation via conditional random field modeling. In: PACLIC (2006)
Zhao, H., Huang, C.N., Li, M.: An improved Chinese word segmentation system with conditional random field. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing (2006)
Wang, K., Zong, C., Su, K.Y.: A character-based joint model for Chinese word segmentation. In: Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, pp. 1173–1181 (2010)
Jiang, W., Huang, L., Liu, Q., et al.: A cascaded linear model for joint chinese word segmentation and part-of-speech tagging. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics (2008)
Jiang, W., Huang, L., Liu, Q.: Automatic adaptation of annotation standards: Chinese word segmentation and POS tagging: a case study. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing, pp. 522–530 (2009)
Li, X., Zong, C., Su, K.: A unified model for solving the OOV problem of chinese word segmentation. ACM Trans. Asian Low-Resour. Lang. Inf. Proces. 14(3) (2015). 12
Wang, K., Zong, C., Su, K.Y.: Integrating generative and discriminative character-based models for Chinese word segmentation. ACM Trans. Asian Lang. Inf. Proces. (TALIP), 11(2) (2012). 7
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 133–142 (1996)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, X. (2018). Comparisons of Features for Chinese Word Segmentation. In: Yuan, H., Geng, J., Liu, C., Bian, F., Surapunt, T. (eds) Geo-Spatial Knowledge and Intelligence. GSKI 2017. Communications in Computer and Information Science, vol 849. Springer, Singapore. https://doi.org/10.1007/978-981-13-0896-3_49
Download citation
DOI: https://doi.org/10.1007/978-981-13-0896-3_49
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-0895-6
Online ISBN: 978-981-13-0896-3
eBook Packages: Computer ScienceComputer Science (R0)