[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/1873781.1873871dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
research-article
Free access

Using cross-lingual projections to generate semantic role labeled corpus for Urdu: a resource poor language

Published: 23 August 2010 Publication History

Abstract

In this paper we explore the possibility of using cross lingual projections that help to automatically induce role-semantic annotations in the PropBank paradigm for Urdu, a resource poor language. This technique provides annotation projections based on word alignments. It is relatively inexpensive and has the potential to reduce human effort involved in creating semantic role resources. The projection model exploits lexical as well as syntactic information on an English-Urdu parallel corpus. We show that our method generates reasonably good annotations with an accuracy of 92% on short structured sentences. Using the automatically generated annotated corpus, we conduct preliminary experiments to create a semantic role labeler for Urdu. The results of the labeler though modest, are promising and indicate the potential of our technique to generate large scale annotations for Urdu.

References

[1]
Ambati, Vamshi and Chen, Wei, 2007. Cross Lingual Syntax Projection for Resource-Poor Languages. CMU.
[2]
Baker, Collin. F., Charles J. Fillmore, John B. Lowe. 1998. The Berkeley Frame Net project. COLING-ACL.
[3]
Bharati, Akshar, Dipti Misra Sharma, Lakshmi Bai and Rajeev Sangal. 2006. AnnCorra: Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Language. Technical Report, Language Technologies Research Centre IIIT, Hyderabad.
[4]
Burchardt, Aljoscha and Anette Frank. 2006. Approaching textual entailment with LFG and FrameNet frames. RTE-2 Workshop. Venice, Italy.
[5]
Butt, Miriam and Wilhelm Geuder. 2001. On the (semi)lexical status of light verbs. Norbert Corver and Henk van Riemsdijk, (Eds.), Semi-lexical Categories: On the content of function words and the function of content words, Mouton de Gruyter, pp. 323--370, Berlin.
[6]
Diab, Mona and Philip Resnik. 2002. An unsupervised method for word sense tagging using parallel corpora. 40th Annual Meeting of ACL, pp. 255--262, Philadelphia, PA.
[7]
Dorr, Bonnie, J. 1994. Machine Translation Divergences: A Formal Description and Proposed Solution. ACL, Vol. 20(4), pp. 597--631.
[8]
Fillmore, Charles J. 1968. The case for case. Bach, & Harms (Eds), Universals in Linguistic Theory, pp. 1--88. Holt, Rinehart, and Winston, New York.
[9]
Fillmore, Charles J. 1982. Frame semantics. Linguistics in the Morning Calm, pp. 111--137. Hanshin, Seoul, S. Korea.
[10]
Fung, Pascale and Benfeng Chen. 2004. BiFrameNet: Bilingual frame semantics resources construction by crosslingual induction. 20th International Conference on Computational Linguistics, pp. 931--935, Geneva, Switzerland.
[11]
Fung, Pascale, Zhaojun Wu, Yongsheng Yang and Dekai Wu. 2007. Learning bilingual semantic frames: Shallow semantic parsing vs. semantic role projection. 11th Conference on Theoretical and Methodological Issues in Machine Translation, pp. 75--84, Skovde, Sweden.
[12]
Gildea, Daniel and Daniel Jurafsky. 2002. Automatic labeling of semantic roles. Computational Linguistics, Vol. 28(3), pp. 245--288.
[13]
Hi, Chenhai and Rebecca Hwa. 2005. A backoff model for bootstrapping resources for non-english languages. Joint Human Language Technology Conference and Conference on EMNLP, pp. 851--858, Vancouver, BC.
[14]
Hwa, Rebecca, Philip Resnik, Amy Weinberg, and Okan Kolak. 2002. Evaluation translational correspondance using annotation projection. 40th Annual Meeting of ACL, pp. 392--399, Philadelphia, PA.
[15]
Koehn, Phillip. 2005. "Europarl: A parallel corpus for statistical machine translation," MT summit, Citeseer.
[16]
Lafferty, John D., Andrew McCallum and C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. 18th International Conference on Machine Learning, pp. 282--289.
[17]
Liang, Percy, Ben Taskar, and Dan Klein. 2006. Alignment by Agreement, NAACL.
[18]
Marcus, Mitchell P., Beatrice Santorini and Mary Ann Marcinkiewicz. 2004. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, Vol. 19(2), pp. 313--330.
[19]
Moschitti, Alessandro. 2008. Kernel methods, syntax and semantics for relational text categorization. 17th ACM CIKM, pp. 253--262, Napa Valley, CA.
[20]
Mukerjee, Amitabh, Ankit Soni and Achala M. Raina. 2006. Detecting Complex Predicates in Hindi using POS Projection across Parallel Corpora. Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, pp. 11--18. Sydney.
[21]
Mukund, S., Srihari, R. K., and Peterson, E. 2010. An Information Extraction System for Urdu - A Resource Poor Language. Special Issue on Information Retrieval for Indian Languages. TALIP.
[22]
Pado, Sebastian and Mirella Lapata. 2009. Cross-Lingual annotation Projection of Semantic Roles. Journal of Artificial Intelligence Research, Vol. 36, pp. 307--340.
[23]
Palmer, Martha, Daniel Gildea, and Paul Kingsbury. 2005. The proposition bank: An annotated corpus of semantic roles. Computational Linguistics, Vol. 31(1).
[24]
Palmer, Martha, Shijong Ryu, Jinyoung Choi, Sinwon Yoon, and Yeongmi Jeon. 2006. Korean Propbank. Linguistic data consortium, Philadelphia.
[25]
Philips, Lawrence. 2000. The Double Metaphone Search Algorithm. C/C++ Users Journal.
[26]
Sinha, R. Mahesh K. 2009. Mining Complex Predicates In Hindi Using A Parallel Hindi-English Corpus. ACL International Joint Conference in Natural Language Processing, pp 40.
[27]
Surdeanu, Mihai, Sanda Harabagiu, John Williams and Paul Aarseth. 2003. Using predicate-argument structures for information extraction. 41st Annual Meeting of the Association for Computational Linguistics, pp. 8--15, Sapporo, Japan.
[28]
Taule, Mariona, M. Antonio Marti, and Marta Recasens. 2008. Ancora: Multi level annotated corpora for Catalan and Spanish. 6th International Conference on Language Resources and Evaluation, Marrakesh, Morocco.
[29]
Xue, Nianwen and Martha Palmer. 2009. Adding semantic roles to the Chinese treebank. Natural Language Engineering, Vol. 15(1), pp. 143--172.
[30]
Yarowsky, David, Grace Ngai and Richard Wicentowski. 2001. Inducing multi lingual text analysis tools via robust projection across aligned corpora. 1st Human Language Technology Conference, pp. 161--168, San Francisco, CA.

Cited By

View all
  • (2017)Modeling of learning curves with applications to POS taggingComputer Speech and Language10.1016/j.csl.2016.06.00141:C(1-28)Online publication date: 1-Jan-2017
  • (2011)Sentiment analysis of urdu languageProceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I10.1007/978-3-642-25324-9_33(382-393)Online publication date: 26-Nov-2011
  1. Using cross-lingual projections to generate semantic role labeled corpus for Urdu: a resource poor language

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image DL Hosted proceedings
        COLING '10: Proceedings of the 23rd International Conference on Computational Linguistics
        August 2010
        1408 pages

        Publisher

        Association for Computational Linguistics

        United States

        Publication History

        Published: 23 August 2010

        Qualifiers

        • Research-article

        Acceptance Rates

        Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)68
        • Downloads (Last 6 weeks)5
        Reflects downloads up to 16 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2017)Modeling of learning curves with applications to POS taggingComputer Speech and Language10.1016/j.csl.2016.06.00141:C(1-28)Online publication date: 1-Jan-2017
        • (2011)Sentiment analysis of urdu languageProceedings of the 10th Mexican international conference on Advances in Artificial Intelligence - Volume Part I10.1007/978-3-642-25324-9_33(382-393)Online publication date: 26-Nov-2011

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Login options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media