Abstract
In statistical machine translation, training data usually have the characteristics of diverse sources, multiple themes, different genre, and are often not in accordance with the domain of target text to be translated, resulting in domain adaptive problem. The existing adaptive methods for statistical machine translation aim for the target text and focus on the selection of training data and the adjustment of translation models. These approaches have not specified explicit domain labels for texts or data. This study gives explicit domain labels and uses two examples for specific context knowledge, (1) Domain knowledge based on Chinese Thesaurus are applied to assign domain labels of Chinese Library Classification Number to Chinese texts; (2) Two-dimensional lexicalized domain knowledge, such as Semantic Category and Application Scenarios, is used to label Japanese sentence. Based on the obtained domain labels for development data and test data, the training data can be filtered to achieve the goal of domain consistency. Experiments show that only a part of the training data can gain a comparable translation performance to the whole training data. This shows that the method is efficient and feasible.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-volume, North American, pp. 127–133 (2003)
Lei, C., Ming, Z.: An overview of domain adaptation for statistical machine translation. Intell. Comput. Appl. 4(6), 31–34 (2014)
Zeng, J., Chang, C.: Function orientation and development of new edition of chinese thesaurus under network environment. J. China Soc. Sci. Tech. Inf. 29(6), 973–977 (2010)
Chinese Thesaurus. Scientific and Technical Documentation Press (1991)
Shunian, C.: The first electronic edition of Chinese library classification. Lib. Inf. Serv. 3, 55–60 (2002)
Eck, M., Vogel, S., Waibel, A.: Low cost portability for statistical machine translation based on n-gram coverage. In: Proceedings of Mtsummit X (2005)
Zhao, B., Eck, M., Vogel, S.: Language model adaptation for statistical machine translation with structured query models. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 411. Association for Computational Linguistics, The University of Geneva, Switzerland (2004)
Lü, Y., Huang, J., Liu, Q.: Improving statistical machine translation performance by training data selection and optimization. In: EMNLP-CoNLL 2007, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 28–30 June 2007, Prague, Czech Republic, pp. 343–350 (2007)
Matsoukas, S., Rosti, A., Zhang, B.: Discriminative corpus weight estimation for machine translation. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2, vol. 2, pp. 708–717. Association for Computational Linguistics, Singapore (2009)
Moore, R.C., Lewis, W.: Intelligent selection of language model training data. In: ACL 2010, Proceedings of the, Meeting of the Association for Computational Linguistics, 11–16 July 2010, Uppsala, Sweden, Short Papers, pp. 220–224 (2010)
Axelrod, A., He, X., Gao, J.: Domain adaptation via pseudo in-domain data selection. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 355–362. Association for Computational Linguistics, Edinburgh, UK (2011)
Shujie, Y., Tong, X., Jingbo, Z.: Selectiion of SMT training data based on sentence pair quality and coverage. J. Chin. Inf. Process. 25(2), 72–77 (2011)
Foster, G., Kuhn, R.: Mixture model adaptation for SMT. In: Proceedings of Second Workshop on Statistical Machine Translation, pp. 128–135. Association for Computational Linguistics, Prague (2007)
Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modeling. In: Proceedings of the Second workshop Statistical Machine Translation, pp. 177–180. Association for Computational Linguistics, Prague (2007)
Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second, Workshop on Statistical Machine Translation, pp. 224–227. Association for Computational Linguistics, Prague (2007)
Finch, A., Sumita, E.: Dynamic model interpolation for statistical machine translation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 208–215. Association for Computational Linguistics, Columbus (2008)
Foster, G., Goutte, C., Kuhn, R.: Discriminative instance weighting for domain adaptation in statistical machine translation. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pp. 451–459. Association for Computational Linguistics, Cambridge (2010)
Banerjee, P., Naskar, S.K., Roturier, J., et al.: Domain adaptation in statistical machine translation of user-forum data using component level mixture modelling. In: Proceedings of Machine Translation Summit XIII, Xiamen, China, pp. 285–292 (2011)
Sennrich, R.: Perplexity minimization for translation model domain adaptation in statistical machine translation. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 539–549. Association for Computational Linguistics, Avignon (2012)
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of the 49th ACL: Shortpapers, pp. 407–412. Association for Computational Linguistics, Portland (2011)
Ueffing, N., Haffari, G., Sarkar, A.: Semi-supervised model adaptation for statistical machine translation. Mach. Transl. 21, 71–94 (2007)
Wu, H.,Wang, H.,Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corploa. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 993–1000. COLING 2008 Organizing Committee, Manchester (2008)
Schwenk, H.: Investigations on large-scale lightly supervised training for statistical machine translation. In: Proceedings of the International Workshop on Spoken Language Translation, pp. 182–189. IWSLT, Hawaii (2008)
Zhao, B., Xing, E.P.: BiTAM:Bilingual topic admixture models for word alignment. In: Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pp. 969–976. Association for Computational Linguistics, Sydney (2006)
Zhao, B., Xing, E.P.: HM-BiTAM: Bilingual topic exploration, word alignment, and translation. In: Advances in Neural Information Processing Systems, pp. 1689–1696. Vancouver, British Columbia (2008)
Tam, Y.C., Lane, I., Schultz, T.: Bilingual LSA-based adaptation for statistical machine translation.Mach. Transl. 2l(4), 187–207 (2007)
Su, J.,Wu, H., Wang, H., et a1.: Translation model adaptation for statistical machine translation with monolingual topic information. In: Proceedings of Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 459–468. Association for Computational Linguistics, Jeju (2012)
Xiao, X., Xiong, D., Zhang, M., et a1.: A topic similarity model for hierarchical phrase-based translation. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 750–758. Association for Computational Linguistics, Jeju (2012)
Ding, L., Li, Y., He, Y., Wang, X., Zhang, Y., Yao, C.: Experimental study on training data selection of SMT based on chinese thesaurus. J. China Soc. Sci. Tech. Inf. (accepted)
Ding, L., Li, Y., He, Y., Liu, J.: Research on Japanese-Chinese S&T terminology translation based-on two-dimensional domain lexicalized domain knowledge. In: CWMT 2016, Urumchi, China, vol. 8, pp. 25–26 (2016)
Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Meeting on Association for Computational Linguistics, pp. 295–302. Association for Computational Linguistics, Stroudsburg, USA (2002)
Xiong, D., Liu, Q., Lin, S.: Maximum entropy based phrase reordering model for statistical machine translation. In: Proceedings of COLING-ACL, Sydney, Australia, pp. 521–528 (2006)
Xiao, T., Zhu, J., Zhang, H., et al.: NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation. In: ACL 2012 System Demonstrations, Jeju, Republic of Korea, pp. 19–24 (2012)
Hashimoto, C., Kurohashi, S.: Construction of domain dictionary for fundamental vocabulary and its application to automatic blog categorization with the dynamic estimation of unknown words’ domains. J. Nat. Lang. Process. 15(5), 73–97 (2008)
Kurohashi, S., Nakamura, T., Matsumoto, Y., et al.: Improvements of Japanese morphological analyzer JUMAN. In: Proceedings of The International Workshop on Sharable Natural Language, pp. 22–28 (1994)
Acknowledgments
This research work was partially supported by National Natural Science of China (61303152, 71503240), and ISTIC Research Foundation Projects (ZD2016-05).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
He, Y., Ding, L., Li, Y. (2016). Research on Domain Adaptation for SMT Based on Specific Domain Knowledge. In: Yang, M., Liu, S. (eds) Machine Translation. CWMT 2016. Communications in Computer and Information Science, vol 668. Springer, Singapore. https://doi.org/10.1007/978-981-10-3635-4_5
Download citation
DOI: https://doi.org/10.1007/978-981-10-3635-4_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-3634-7
Online ISBN: 978-981-10-3635-4
eBook Packages: Computer ScienceComputer Science (R0)