Integrating Multi-source Bilingual Information for Chinese Word Segmentation in Statistical Machine Translation

Wei Chen²³,
Wei Wei²³,
Zhenbiao Chen²³ &
…
Bo Xu²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8202))

Included in the following conference series:

1637 Accesses

Abstract

Chinese texts are written without spaces between the words, which is problematic for Chinese-English statistical machine translation (SMT). The most widely used approach in existing SMT systems is apply a fixed segmentations produced by the off-the-shelf Chinese word segmentation (CWS) systems to train the standard translation model. Such approach is sub-optimal and unsuitable for SMT systems. We propose a joint model to integrate the multi-source bilingual information to optimize the segmentations in SMT. We also propose an unsupervised algorithm to improve the quality of the joint model iteratively. Experiments show that our method improve both segmentation and translation performance in different data environment.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

An Improved Method of Applying a Machine Translation Model to a Chinese Word Segmentation Task

Optimized Uyghur Segmentation for Statistical Machine Translation

Semi-supervised Learning for Mongolian Morphological Segmentation

References

Xu, J., Zens, R., Ney, H.: Do we need Chinese word segmentation for statistical machine translation. In: Proc. of the Third SIGHAN Workshop on Chinese Language Learning, Barcelona, Spain (2004)
Google Scholar
Zhang, R., Yasuda, K., Sumita, E.: Improved Statistical Machine Translation by Multiple Chinese Word Segmentation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 216–223 (2008)
Google Scholar
Chang, P.-C., Galley, M., Manning, C.D.: Optimizing Chinese Word Segmentation for Machine Translation Performance. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 224–232 (2008)
Google Scholar
Teahan, W.J., Wen, Y., McNab, R., Witten, I.H.: A Compression-based Algorithm for Chinese Word Segmentation. Computational Linguistics 26(3), 375–393 (2000)
Article Google Scholar
Zhang, H.-P., Yu, H.-K., Xiong, D.-Y., Liu, Q.: HHMM-based Chinese lexical analyzer ICTCLAS. In: Proceedings of the Second SIGHAN Workshop on Chinese Language Learning, pp. 184–187 (2003)
Google Scholar
Xue, N.: Chinese Word Segmentation as Character Tagging. Computational Linguistics and Chinese Language Processing 8(1), 29–48 (2003)
Google Scholar
Tseng, H., Chang, P., Andrew, G., Jurafsky, D., Manning, C.D.: A conditional random field word segmenter for Sighan bakeoff 2005. In: Proc. of the Fourth SIGHAN Workshop on Chinese Language Processing (2005)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. 18th International Conf. on Machine Learning (2001)
Google Scholar
Xu, J., Gao, J., Toutanova, K., Ney, H.: Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 1017–1024 (2008)
Google Scholar
Peng, F., Feng, F., McCallum, A.: Chinese segmentation and new word detection using conditional random fields. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 562 (2004)
Google Scholar
IWSLT: International workshop on spoken language translation home page (2007), http://www.slt.atr.jp/IWSLT2007
Och, F.J., Ney, H.: Improved statistical alignment models. In: Proceedings of ACL, pp. 440–447 (2000)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of ACL, pp. 160–167 (2003)
Google Scholar
Stolcke, A.: SRILM - An extensible language modeling toolkit. In: Proceedings of ICSLP, pp. 901–904 (2002)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: A method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
Google Scholar
IWSLT: International workshop on spoken language translation home page (2005), http://www.slt.atr.jp/IWSLT2005

Download references

Author information

Authors and Affiliations

Interactive Digital Media Technology Research Center(IDMTech) Institute of Automation, Chinese Academy of Sciences, China
Wei Chen, Wei Wei, Zhenbiao Chen & Bo Xu

Authors

Wei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Zhenbiao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Bo Xu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Maosong Sun
Horizon Doctoral Training Centre, School of Computer Science, University of Nottingham, NG8 1BB, Nottingham, UK
Min Zhang
Google Inc., Mountain View, CA, USA
Dekang Lin
Baidu Inc., Beijing, China
Haifeng Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, W., Wei, W., Chen, Z., Xu, B. (2013). Integrating Multi-source Bilingual Information for Chinese Word Segmentation in Statistical Machine Translation. In: Sun, M., Zhang, M., Lin, D., Wang, H. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2013 2013. Lecture Notes in Computer Science(), vol 8202. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41491-6_7

Download citation

DOI: https://doi.org/10.1007/978-3-642-41491-6_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41490-9
Online ISBN: 978-3-642-41491-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics