More Web Proxy on the site http://driver.im/

research-article

Free access

Using large monolingual and bilingual corpora to improve coordination disambiguation

Authors:

David Yarowsky,

Kenneth ChurchAuthors Info & Claims

HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

Pages 1346 - 1355

Published: 19 June 2011 Publication History

Abstract

Resolving coordination ambiguity is a classic hard problem. This paper looks at co-ordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don't do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data and makes 20% fewer errors than a supervised system trained with Treebank annotations.

References

[1]

Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proc. ACL, pages 597--604.

Digital Library

[2]

Shane Bergsma, Emily Pitler, and Dekang Lin. 2010. Creating robust supervised classifiers via web-scale n-gram data. In Proc. ACL, pages 865--874.

Digital Library

[3]

Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proc. COLT, pages 92--100.

Digital Library

[4]

Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram Corpus Version 1.1. LDC2006T13.

[5]

David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). In Proc. EMNLP, pages 877--886.

Digital Library

[6]

David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unannotated bilingual text. In Proc. CoNLL, pages 46--53.

Digital Library

[7]

James Curran, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale NLP with C&C and Boxer. In Proc. ACL Demo and Poster Sessions, pages 33--36.

Digital Library

[8]

Ido Dagan and Alan Itai. 1990. Automatic processing of large corpora for the resolution of anaphora references. In Proc. COLING, pages 330--332.

Digital Library

[9]

Ido Dagan, Alon Itai, and Ulrike Schwall. 1991. Two languages are more informative than one. In Proc. ACL, pages 130--137.

Digital Library

[10]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR, 9:1871--1874.

Digital Library

[11]

Victoria Fossum and Kevin Knight. 2008. Using bilingual Chinese-English word alignments to resolve PP-attachment ambiguity in English. In Proc. AMTA Student Workshop, pages 48--53.

[12]

Donald Hindle and Mats Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics, 19(1):103--120.

Digital Library

[13]

Deirdre Hogan. 2007. Coordinate noun phrase disambiguation in a generative parsing model. In Proc. ACL, pages 680--687.

[14]

Liang Huang, Wenbin Jiang, and Qun Liu. 2009. Bilingually-constrained (monolingual) shift-reduce parsing. In Proc. EMNLP, pages 1222--1231.

Digital Library

[15]

Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3):311--325.

Digital Library

[16]

Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proc. MT Summit X.

[17]

Jonas Kuhn. 2004. Experiments in parallel-text based grammar induction. In Proc. ACL, pages 470--477.

Digital Library

[18]

Mirella Lapata and Frank Keller. 2005. Web-based models for natural language processing. ACM Trans. Speech and Language Processing, 2(1):1--31.

Digital Library

[19]

Mark Lauer. 1995. Corpus statistics meet the noun compound: Some empirical results. In Proc. ACL, pages 47--54.

Digital Library

[20]

Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New tools for web-scale N-grams. In Proc. LREC.

[21]

Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Proc. LREC Workshop on the Evaluation of Parsing Systems.

[22]

Mitchell P. Marcus, Beatrice Santorini, and Mary Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330.

Digital Library

[23]

Preslav Nakov and Marti Hearst. 2005. Using the web as an implicit training set: application to structural ambiguity resolution. In Proc. HLT-EMNLP, pages 17--24.

Digital Library

[24]

Xuan-Hieu Phan. 2006. CRFTagger: CRF English POS Tagger. crftagger.sourceforge.net.

[25]

Emily Pitler, Shane Bergsma, Dekang Lin, and Kenneth Church. 2010. Using web-scale N-grams to improve base NP parsing performance. In In Proc. COLING, pages 886--894.

Digital Library

[26]

Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95--130.

Digital Library

[27]

Vasile Rus, Sireesha Ravi, Mihai C. Lintean, and Philip M. McCarthy. 2007. Unsupervised method for parsing coordinated base noun phrases. In Proc. CICLing, pages 229--240.

Digital Library

[28]

Florian Schwarck, Alexander Fraser, and Hinrich Schütze. 2010. Bitext-based resolution of German subject-object ambiguities. In Proc. HLT-NAACL, pages 737--740.

Digital Library

[29]

Lee Schwartz, Takako Aikawa, and Chris Quirk. 2003. Disambiguation of English PP attachment using multilingual aligned data. In Proc. MT Summit IX, pages 330--337.

[30]

David A. Smith and Noah A. Smith. 2004. Bilingual parsing with factored estimation: Using English to parse Korean. In Proc. EMNLP, pages 49--56.

[31]

Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proc. ACL-IJCNLP, pages 1041--1050.

Digital Library

[32]

David Vadas and James R. Curran. 2007a. Adding noun phrase structure to the Penn Treebank. In Proc. ACL, pages 240--247.

[33]

David Vadas and James R. Curran. 2007b. Large-scale supervised models for noun phrase bracketing. In PACLING, pages 104--112.

[34]

David Vadas and James R. Curran. 2008. Parsing noun phrase structure with CCG. In Proc. ACL, pages 104--112.

[35]

Vladimir N. Vapnik. 1998. Statistical Learning Theory. John Wiley & Sons.

[36]

Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377--403.

Digital Library

[37]

David Yarowsky and Grace Ngai. 2001. Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proc. NAACL, pages 1--8.

Digital Library

[38]

David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. ACL, pages 189--196.

Digital Library

Cited By

Church K(2013)How many multiword expressions do people know?ACM Transactions on Speech and Language Processing 10.1145/2483691.248369310:2(1-13)Online publication date: 21-Jun-2013
https://dl.acm.org/doi/10.1145/2483691.2483693
Pitler ELi HLin COsborne M(2012)Attacking parsing bottlenecks with unlabeled data and relevant factorizationsProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 110.5555/2390524.2390633(768-776)Online publication date: 8-Jul-2012
https://dl.acm.org/doi/10.5555/2390524.2390633
Sim YSmith NSmith D(2012)Discovering factions in the computational linguistics communityProceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries10.5555/2390507.2390511(22-32)Online publication date: 10-Jul-2012
https://dl.acm.org/doi/10.5555/2390507.2390511
Show More Cited By

Index Terms

Using large monolingual and bilingual corpora to improve coordination disambiguation
1. Applied computing
  1. Arts and humanities
    1. Language translation
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
      2. Machine translation

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Unsupervised word sense disambiguation using bilingual comparable corpora
COLING '02: Proceedings of the 19th international conference on Computational linguistics - Volume 1

An unsupervised method for word sense disambiguation using a bilingual comparable corpus was developed. First, it extracts statistically significant pairs of related words from the corpus of each language. Then, aligning pairs of related words ...
Bilingual and monolingual brains compared: A functional magnetic resonance imaging investigation of syntactic processing and a possible “neural signature” of bilingualism

Does the brain of a bilingual process language differently from that of a monolingual? We compared how bilinguals and monolinguals recruit classic language brain areas in response to a language task and asked whether there is a “neural signature” of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1

June 2011

1696 pages

ISBN:9781932432879

General Chair:
Dekang Lin
Google

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 19 June 2011

Qualifiers

Research-article

Acceptance Rates

Overall Acceptance Rate 240 of 768 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Church K(2013)How many multiword expressions do people know?ACM Transactions on Speech and Language Processing 10.1145/2483691.248369310:2(1-13)Online publication date: 21-Jun-2013
https://dl.acm.org/doi/10.1145/2483691.2483693
Pitler ELi HLin COsborne M(2012)Attacking parsing bottlenecks with unlabeled data and relevant factorizationsProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 110.5555/2390524.2390633(768-776)Online publication date: 8-Jul-2012
https://dl.acm.org/doi/10.5555/2390524.2390633
Sim YSmith NSmith D(2012)Discovering factions in the computational linguistics communityProceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries10.5555/2390507.2390511(22-32)Online publication date: 10-Jul-2012
https://dl.acm.org/doi/10.5555/2390507.2390511
Church KKordoni VRamisch CVillavicencio A(2011)How many multiword expressions do people know?Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World10.5555/2021121.2021152(137-144)Online publication date: 23-Jun-2011
https://dl.acm.org/doi/10.5555/2021121.2021152

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents