Abstract
Most of ancient Chinese texts have no punctuations or segmentation of sentences. Recent researches on automatic ancient Chinese sentence segmentation usually resorted to sequence labelling models and utilized small data sets. In this paper, we propose a sentence segmentation method for ancient Chinese texts based on neural network language models. Experiments on large-scale corpora indicate that our method is effective and achieves a comparable result to the traditional CRF model. Implementing sentence length penalty, using larger Simplified Chinese corpora, or dividing corpora by ages can further improve performance of our model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Zhang, H., Wang, X., Yang, J., Zhou, W.: Method of sentence segmentation and punctuating for ancient Chinese literatures based on cascaded CRF. Application Research of Computers 26(9), 3326–3329 (2009). (in Chinese)
Zhang, K., Xia, Y., Hang, Y.U.: CRF-based approach to sentence segmentation and punctuation for ancient Chinese prose. Journal of Tsinghua University 49(10), 1733–1736 (2009). (in Chinese)
Huang, H.H., Sun, C.T., Chen, H.H.: Classical Chinese sentence segmentation. In: Proceedings of CIPS-SIGHAN Joint Conference on Chinese Language Processing (2010)
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML (2001)
Chen, T., Chen, R., Pan, L., Li, H., Yu, Z.: Archaic Chinese punctuating sentences based on context N-gram model. Computer Engineering 33(3), 192–193 (2007). (in Chinese)
Huang, J., Hou, H.: On sentence segmentation and punctuation model for ancient books on agriculture. Journal of Chinese Information Processing 22(4), 31–38 (2008). (in Chinese)
Hinton, G.E.: Learning distributed representations of concepts. In: Proceedings of CogSci (1986)
Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model. In: Proceedings of NIPS (2001)
Mikolov, T., Karafiat, M., Burget, L., Cernockk, J.H., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of Interspeech (2010)
Cho, K., Merrienboer, B.V., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Wang, B., Shi, X., Tan, Z., Chen, Y., Wang, W. (2016). A Sentence Segmentation Method for Ancient Chinese Texts Based on NNLM. In: Dong, M., Lin, J., Tang, X. (eds) Chinese Lexical Semantics. CLSW 2016. Lecture Notes in Computer Science(), vol 10085. Springer, Cham. https://doi.org/10.1007/978-3-319-49508-8_36
Download citation
DOI: https://doi.org/10.1007/978-3-319-49508-8_36
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49507-1
Online ISBN: 978-3-319-49508-8
eBook Packages: Computer ScienceComputer Science (R0)