[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

A phonetic similarity model for automatic extraction of transliteration pairs

Published: 01 September 2007 Publication History

Abstract

This article proposes an approach for the automatic extraction of transliteration pairs from Chinese Web corpora. In this approach, we formulate the machine transliteration process using a syllable-based phonetic similarity model which consists of phonetic confusion matrices and a Chinese character n-gram language model. With the phonetic similarity model, the extraction of transliteration pairs becomes a two-step process of recognition followed by validation: First, in the recognition process, we identify the most probable transliteration in the k-neighborhood of a recognized English word. Then, in the validation process, we qualify the transliteration pair candidates with a hypothesis test. We carry out an analytical study on the statistics of several key factors in English-Chinese transliteration to help formulate phonetic similarity modeling. We then conduct both supervised and unsupervised learning of a phonetic similarity model on a development database. The experimental results validate the effectiveness of the phonetic similarity model by achieving an F-measure of 0.739 in supervised learning. The unsupervised learning approach works almost as well as the supervised one, thus allowing us to deploy automatic extraction of transliteration pairs in the Web space.

References

[1]
Al-Onaizan, Y. and Knight, K. 2002. Translating named entities using monolingual and bilingual resources. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 400--408.
[2]
Brill, E., Kacmarcik, G., and Brockett, C. 2001. Automatically harvesting Katakana-English term pairs from search engine query logs. In Proceedings of the Natural Language Processing Pacific Rim Symposium, 393--399.
[3]
Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of the 7th International World Wide Web Conference, 107--117.
[4]
Brown, P. F., Della Pietra, S. A., Della Pietra, V. J., and Mercer, R. L. 1994. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 2, 263--311.
[5]
Chen, H. H. and Lee, J. C. 1996. Identification and classification of proper nouns in Chinese texts. In Proceedings of the 16th International Conference on Computational Linguistics, 222--229.
[6]
Chen, H. H., Yang, C. H., and Lin, Y. 2003. Learning formulation and transformation rules for multilingual entities. In Proceedings of 41st ACL Workshop on Multilingual and Mixed-language Named Entity Recognition, 1--8.
[7]
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete data via the EM algorithm. J. Royal Statistical Society Ser. B. Vol. 39, 1--38.
[8]
Galescu, L. and Allen, J. 2001. Bi-directional conversion between graphemes and phonemes using a joint N-gram model. In Proceedings of the International Speech Communication Association Tutorial and Research Workshop of Speech Synthesis, 103--108.
[9]
Gao, W., Wong, K. F., and Lam, W. 2004. Phoneme-based transliteration of foreign names for OOV problem. In Proceedings of the 1st International Joint Conference on Natural Language Processing, 374--381.
[10]
Huang, F., Vogel, S., and Waibel, A. 2004. Improving name entity translation combining phonetic and semantic similarities. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting, 281--288.
[11]
Jung, S. Y., Hong, S. L., and Paek, E. 2000. An English to Korean transliteration model of extended Markov window. In Proceedings of the 18th International Conference on Computational Linguistics, 383--389.
[12]
Jurafsky, D. and Martin, J. H. 2000. Speech and Language Processing. Prentice-Hall, Englewood Cliffs, NJ, 91--188.
[13]
Kang, B. J. and Choi, K. S. 2000. Automatic transliteration and back-transliteration by decision tree learning. In Proceedings of the 2nd International Conference on Language Resource and Evaluation, 1135--1411.
[14]
Kang, I. H. and Kim, G. C. 2000. English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks. In Proceedings of the 18th International Conference on Computational Linguistics, pp. 418--424.
[15]
Kleinberg, J. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, ACM, New York, 14--20.
[16]
Knight, K. and Graehl, J. 1998. Machine transliteration. Computational Linguistics 24, 4, 599--612.
[17]
Kuo, J. S. and Yang, Y. K. 2004a. Constructing transliterations lexicons from Web corpora. In the Companion Volume to Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 102--105.
[18]
Kuo, J. S. and Yang, Y. K. 2004b. Generating paired transliterated-cognates using multiple pronunciation characteristics from Web corpora. In Proceedings of the 18th Pacific Asia Conference on Language, Information and Computation, 275--282.
[19]
Kuo, J. S. and Yang, Y. K. 2005. Incorporating pronunciation variation into extraction of transliterated-term pairs from Web corpora. In Proceedings of the International Conference on Chinese Computing, 131--138.
[20]
Lam, W., Huang, R. Z., and Cheung, P. S. 2004. Learning phonetic similarity for matching named entity translations and mining new translations. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 289--296.
[21]
Lee, C. J. and Chang, J. S. 2003. Acquisition of English-Chinese transliterated word pairs from parallel-aligned texts using a statistical machine transliteration model. In Proceedings of the Human Language Technology Conference/North American Chapter of the Association for Computational Linguistics Annual Meeting Workshop on Building and Using Parallel Texts Data-Driven Machine Translation and Beyond, 96--103.
[22]
Lee, J. S. and Choi, K. S. 1998. English to Korea statistical transliteration for information retrieval. Computer Processing of Oriental Languages 12, 1, 17--37.
[23]
Li, H., Zhang, M., and Su, J. 2004. A joint source channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, 159--166.
[24]
Lin, W. H. and Chen, H. H. 2002. Backward machine transliteration by learning phonetic similarity. In Proceedings of the Sixth Conference on Natural Language Learning, 139--145.
[25]
Lin, T., Wu, J. C., and Chang, J. S. 2004. Extraction of name and transliteration in monolingual and parallel corpora. In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas, 177--186.
[26]
Llitjos, A. F. and Black, A. 2001. Knowledge of language origin improves pronunciation accuracy of proper names. In Proceedings of Eurospeech'2001, Vol. 3, 1919--1922.
[27]
Lu, W. H., Chien, L. F., and Lee, H. J. 2002. Translation of Web queries using anchor text mining. ACM Trans, on Asian Language Information Processing 1, 2, 159--172.
[28]
Meng, H., Lo, W. K, Chen, B., and Tang, K. 2001. Generating phonetic cognates to handle named entities in English-Chinese cross-language spoken document retrieval. In Proceedings of the Automatic Speech Recognition and Understanding Workshop, 311--314.
[29]
Myers, C. S. and Rabiner, L. R. 1981. A comparative study of several dynamic time-warping algorithms for connected word recognition. Bell System Technical J. 60, 1, 1389--1409.
[30]
Nagata, M., Saito, T., and Suzuki, K. 2001. Using the Web as a bilingual dictionary. In Proceedings of the 39th ACL Workshop on Data-Driven Methods in Machine Translation, 95--102.
[31]
Oh, J. H. and Choi., K S. 2002. An English-Korean transliteration model using pronunciation and contextual rules. In Proceedings of the 19th International Conference on Computational Linguistic, 758--764.
[32]
Pagel, V., Lenzo, K., and Black, A. 1998. Letter to sound rules for accented lexicon compression. In Proceedings of the International Conference on Spoken Language Processing, 2015--2020.
[33]
Qu, Y., Grefenstette, G., and Evans, D. 2003. Automatic transliteration for Japanese-to-English text retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, New York, 353--360.
[34]
Tsuji, K., Dailley, B., and Kageura, K. 2002. Extracting French-Japanese word pairs from bilingual corpora based on transliteration rules. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, 499--502.
[35]
Virga, P. and Khudanpur, S. 2003. Transliteration of proper names in cross-lingual information retrieval. In Proceedings of the 41st ACL Workshop on Multilingual and Mixed Language Named Entity Recognition, 57--64.
[36]
Wan, S. and Verspoor, C. M. 1998. Automatic English-Chinese name transliteration for development of multilingual resources. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics, 1352--1356.
[37]
Xiao, J., Liu, J., and Chua, T. S. 2002. Extracting pronunciation-translated names from Chinese texts using a bootstrapping approach. In Proceedings of the 1st SIGHAN Workshop on Chinese Language Processing, 1--6.
[38]
Xinhua News Agency. 1992. Chinese Transliteration of Foreign Personal Names. The Commercial Press.

Cited By

View all
  • (2018)Machine transliteration and transliterated text retrieval: a surveySādhanā10.1007/s12046-018-0828-843:6Online publication date: 7-Jun-2018
  • (2013)Sediment traps from synthetic construction site stormwater runoff by grassed filter stripJournal of Hydrology10.1016/j.jhydrol.2013.08.019502(53-61)Online publication date: Oct-2013
  • (2012)Transliteration mining using large training and test setsProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies10.5555/2382029.2382061(243-252)Online publication date: 3-Jun-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian Language Information Processing
ACM Transactions on Asian Language Information Processing  Volume 6, Issue 2
September 2007
84 pages
ISSN:1530-0226
EISSN:1558-3430
DOI:10.1145/1282080
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2007
Published in TALIP Volume 6, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Machine translation
  2. extraction of transliteration pairs
  3. machine transliteration
  4. phonetic confusion probability
  5. phonetic similarity modeling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Machine transliteration and transliterated text retrieval: a surveySādhanā10.1007/s12046-018-0828-843:6Online publication date: 7-Jun-2018
  • (2013)Sediment traps from synthetic construction site stormwater runoff by grassed filter stripJournal of Hydrology10.1016/j.jhydrol.2013.08.019502(53-61)Online publication date: Oct-2013
  • (2012)Transliteration mining using large training and test setsProceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies10.5555/2382029.2382061(243-252)Online publication date: 3-Jun-2012
  • (2011)Improved transliteration mining using graph reinforcementProceedings of the Conference on Empirical Methods in Natural Language Processing10.5555/2145432.2145578(1384-1393)Online publication date: 27-Jul-2011
  • (2011)Mining named entities with temporally correlated bursts from multilingual web news streamsProceedings of the fourth ACM international conference on Web search and data mining10.1145/1935826.1935870(237-246)Online publication date: 9-Feb-2011
  • (2011)Machine transliteration surveyACM Computing Surveys10.1145/1922649.192265443:3(1-46)Online publication date: 29-Apr-2011
  • (2010)Improving name origin recognition with context features and unlabelled dataProceedings of the 23rd International Conference on Computational Linguistics: Posters10.5555/1944566.1944678(972-978)Online publication date: 23-Aug-2010
  • (2010)Transliteration mining with phonetic conflation and iterative trainingProceedings of the 2010 Named Entities Workshop10.5555/1870457.1870464(53-56)Online publication date: 16-Jul-2010
  • (2010)Mining Synonymous Transliterations from the World Wide WebACM Transactions on Asian Language Information Processing (TALIP)10.1145/1731035.17310369:1(1-28)Online publication date: 1-Mar-2010
  • (2010)A novel approach for proper name transliteration verification2010 7th International Symposium on Chinese Spoken Language Processing10.1109/ISCSLP.2010.5684842(89-94)Online publication date: Nov-2010
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media