Abstract
Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based on statistical observations from both general-purpose corpus and domain-specific corpora. A disambiguation strategy for overlapping ambiguities, with a predefined solution for each of the 5,507 pseudo overlapping ambiguities, is proposed consequently, suggesting that over 42% of overlapping ambiguities in Chinese running text could be solved without making any error. Several state-of-the-art word segmenters are used to make comparisons on solving these overlapping ambiguities. Preliminary experiments show that about 2% of the 5,507 pseudo ambiguities which are mistakenly segmented by these segmenters can be properly treated by the proposed strategy.
The research is supported by the National Natural Science Foundation of China under grant number 60573187 and the CINACS project.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Emerson, T.: The second international Chinese word segmentation bakeoff. In: Proceedings of the 4th SIGHAN Workshop, pp. 123–133 (2005)
Huang, C.N.: Segmentation Problems in Chinese Processing. Applied Linguistics 1, 72–78 (1997)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of 18th International Conference of ICML, pp. 282–289 (2001)
Li, R., Liu, S.H., Ye, S.W., Shi, Z.Z.: A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing 15(6), 13–18 (2001) (in Chinese)
Li, M., Gao, J.F., Huang, C.N., Li, J.F.: Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In: Proceedings of SIGHAN 2003, pp. 1–7 (2003)
Liang, N.Y.: A Chinese automatic segmentation system for written texts – CDWS. Journal of Chinese Information Processing 1(2), 44–52 (1987) (in Chinese)
Peng, F.C., Feng, F.F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of COLING 2004, Geneva, Switzerland, pp. 562–568 (2004)
Sproat, R., Emerson, T.: The first international Chinese word segmentation bakeoff. In: Proceedings of the 2nd SIGHAN Workshop, pp. 133–143 (2003)
Sun, M.S., Zuo, Z.P.: Overlapping ambiguities in Chinese text. In: Overlapping ambiguities in Chinese text, pp. 323–338 (1998)
Sun, M.S., Huang, C.N., T’sou, B.K.Y.: 1997. Using character bigram for ambiguity resolution In Chinese word segmentation (5), 332–339 (in Chinese)
Sun, M.S., Zuo, Z.P., T’sou, B.K.Y.: The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing 13(1), 27–37 (1999) (in Chinese)
Swen, B., Yu, S.W.: A graded approach for the efficient resolution of Chinese word segmentation ambiguities. In: Proceedings of 5th Natural Language Processing Pacific Rim Symposium, pp. 19–24 (1999)
Xue, N.W.: Chinese word segmentation as character tagging. International Journal of Computational Linguistics, 8(1), 29–48 (2003)
Yu, S.W., Zhu, X.F.: Grammatical Information Dictionary for Contemporary Chinese. In: Grammatical Information Dictionary for Contemporary Chinese, 2nd edition, 2nd edn. Tsinghua University Press (2003) (in Chinese)
Zheng, J.H., Liu, K.Y.: Research on ambiguous word segmentation technique for Chinese text. In: Language Engineering, pp. 201–206. Tsinghua University Press, Beijing (1997) (in Chinese)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Qiao, W., Sun, M., Menzel, W. (2008). Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2008. Lecture Notes in Computer Science(), vol 5246. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87391-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-540-87391-4_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87390-7
Online ISBN: 978-3-540-87391-4
eBook Packages: Computer ScienceComputer Science (R0)