Abstract
Chinese texts are written without spaces between the words, which is problematic for Chinese-English statistical machine translation (SMT). The most widely used approach in existing SMT systems is apply a fixed segmentations produced by the off-the-shelf Chinese word segmentation (CWS) systems to train the standard translation model. Such approach is sub-optimal and unsuitable for SMT systems. We propose a joint model to integrate the multi-source bilingual information to optimize the segmentations in SMT. We also propose an unsupervised algorithm to improve the quality of the joint model iteratively. Experiments show that our method improve both segmentation and translation performance in different data environment.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Xu, J., Zens, R., Ney, H.: Do we need Chinese word segmentation for statistical machine translation. In: Proc. of the Third SIGHAN Workshop on Chinese Language Learning, Barcelona, Spain (2004)
Zhang, R., Yasuda, K., Sumita, E.: Improved Statistical Machine Translation by Multiple Chinese Word Segmentation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 216–223 (2008)
Chang, P.-C., Galley, M., Manning, C.D.: Optimizing Chinese Word Segmentation for Machine Translation Performance. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 224–232 (2008)
Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A Compression-based Algorithm for Chinese Word Segmentation. Computational Linguistics 26(3), 375–393 (2000)
Zhang, H.-P., Yu, H.-K., Xiong, D.-Y., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Learning, pp. 184–187 (2003)
Xue, N.: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.D.: A conditional random field word segmenter for Sighan bakeoff 2005. In: Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning (2001)
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1017–1024 (2008)
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 562 (2004)
IWSLT: International workshop on spoken language translation home page (2007), http://www.slt.atr.jp/IWSLT2007
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of ACL, pp. 440–447 (2000)
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
IWSLT: International workshop on spoken language translation home page (2005), http://www.slt.atr.jp/IWSLT2005
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, W., Wei, W., Chen, Z., Xu, B. (2013). Integrating Multi-source Bilingual Information for Chinese Word Segmentation in Statistical Machine Translation. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_7
Download citation
DOI: https://doi.org/10.1007/978-3-642-41491-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)