[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2002472.2002639dlproceedingsArticle/Chapter ViewAbstractPublication PageshltConference Proceedingsconference-collections
research-article
Free access

Using large monolingual and bilingual corpora to improve coordination disambiguation

Published: 19 June 2011 Publication History

Abstract

Resolving coordination ambiguity is a classic hard problem. This paper looks at co-ordination disambiguation in complex noun phrases (NPs). Parsers trained on the Penn Treebank are reporting impressive numbers these days, but they don't do very well on this problem (79%). We explore systems trained using three types of corpora: (1) annotated (e.g. the Penn Treebank), (2) bitexts (e.g. Europarl), and (3) unannotated monolingual (e.g. Google N-grams). Size matters: (1) is a million words, (2) is potentially billions of words and (3) is potentially trillions of words. The unannotated monolingual data is helpful when the ambiguity can be resolved through associations among the lexical items. The bilingual data is helpful when the ambiguity can be resolved by the order of words in the translation. We train separate classifiers with monolingual and bilingual features and iteratively improve them via co-training. The co-trained classifier achieves close to 96% accuracy on Treebank data and makes 20% fewer errors than a supervised system trained with Treebank annotations.

References

[1]
Colin Bannard and Chris Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proc. ACL, pages 597--604.
[2]
Shane Bergsma, Emily Pitler, and Dekang Lin. 2010. Creating robust supervised classifiers via web-scale n-gram data. In Proc. ACL, pages 865--874.
[3]
Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proc. COLT, pages 92--100.
[4]
Thorsten Brants and Alex Franz. 2006. The Google Web 1T 5-gram Corpus Version 1.1. LDC2006T13.
[5]
David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). In Proc. EMNLP, pages 877--886.
[6]
David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unannotated bilingual text. In Proc. CoNLL, pages 46--53.
[7]
James Curran, Stephen Clark, and Johan Bos. 2007. Linguistically motivated large-scale NLP with C&C and Boxer. In Proc. ACL Demo and Poster Sessions, pages 33--36.
[8]
Ido Dagan and Alan Itai. 1990. Automatic processing of large corpora for the resolution of anaphora references. In Proc. COLING, pages 330--332.
[9]
Ido Dagan, Alon Itai, and Ulrike Schwall. 1991. Two languages are more informative than one. In Proc. ACL, pages 130--137.
[10]
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. 2008. LIBLINEAR: A library for large linear classification. JMLR, 9:1871--1874.
[11]
Victoria Fossum and Kevin Knight. 2008. Using bilingual Chinese-English word alignments to resolve PP-attachment ambiguity in English. In Proc. AMTA Student Workshop, pages 48--53.
[12]
Donald Hindle and Mats Rooth. 1993. Structural ambiguity and lexical relations. Computational Linguistics, 19(1):103--120.
[13]
Deirdre Hogan. 2007. Coordinate noun phrase disambiguation in a generative parsing model. In Proc. ACL, pages 680--687.
[14]
Liang Huang, Wenbin Jiang, and Qun Liu. 2009. Bilingually-constrained (monolingual) shift-reduce parsing. In Proc. EMNLP, pages 1222--1231.
[15]
Rebecca Hwa, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering, 11(3):311--325.
[16]
Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proc. MT Summit X.
[17]
Jonas Kuhn. 2004. Experiments in parallel-text based grammar induction. In Proc. ACL, pages 470--477.
[18]
Mirella Lapata and Frank Keller. 2005. Web-based models for natural language processing. ACM Trans. Speech and Language Processing, 2(1):1--31.
[19]
Mark Lauer. 1995. Corpus statistics meet the noun compound: Some empirical results. In Proc. ACL, pages 47--54.
[20]
Dekang Lin, Kenneth Church, Heng Ji, Satoshi Sekine, David Yarowsky, Shane Bergsma, Kailash Patil, Emily Pitler, Rachel Lathbury, Vikram Rao, Kapil Dalwani, and Sushant Narsale. 2010. New tools for web-scale N-grams. In Proc. LREC.
[21]
Dekang Lin. 1998. Dependency-based evaluation of MINIPAR. In Proc. LREC Workshop on the Evaluation of Parsing Systems.
[22]
Mitchell P. Marcus, Beatrice Santorini, and Mary Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313--330.
[23]
Preslav Nakov and Marti Hearst. 2005. Using the web as an implicit training set: application to structural ambiguity resolution. In Proc. HLT-EMNLP, pages 17--24.
[24]
Xuan-Hieu Phan. 2006. CRFTagger: CRF English POS Tagger. crftagger.sourceforge.net.
[25]
Emily Pitler, Shane Bergsma, Dekang Lin, and Kenneth Church. 2010. Using web-scale N-grams to improve base NP parsing performance. In In Proc. COLING, pages 886--894.
[26]
Philip Resnik. 1999. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95--130.
[27]
Vasile Rus, Sireesha Ravi, Mihai C. Lintean, and Philip M. McCarthy. 2007. Unsupervised method for parsing coordinated base noun phrases. In Proc. CICLing, pages 229--240.
[28]
Florian Schwarck, Alexander Fraser, and Hinrich Schütze. 2010. Bitext-based resolution of German subject-object ambiguities. In Proc. HLT-NAACL, pages 737--740.
[29]
Lee Schwartz, Takako Aikawa, and Chris Quirk. 2003. Disambiguation of English PP attachment using multilingual aligned data. In Proc. MT Summit IX, pages 330--337.
[30]
David A. Smith and Noah A. Smith. 2004. Bilingual parsing with factored estimation: Using English to parse Korean. In Proc. EMNLP, pages 49--56.
[31]
Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proc. ACL-IJCNLP, pages 1041--1050.
[32]
David Vadas and James R. Curran. 2007a. Adding noun phrase structure to the Penn Treebank. In Proc. ACL, pages 240--247.
[33]
David Vadas and James R. Curran. 2007b. Large-scale supervised models for noun phrase bracketing. In PACLING, pages 104--112.
[34]
David Vadas and James R. Curran. 2008. Parsing noun phrase structure with CCG. In Proc. ACL, pages 104--112.
[35]
Vladimir N. Vapnik. 1998. Statistical Learning Theory. John Wiley & Sons.
[36]
Dekai Wu. 1997. Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3):377--403.
[37]
David Yarowsky and Grace Ngai. 2001. Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proc. NAACL, pages 1--8.
[38]
David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. ACL, pages 189--196.

Cited By

View all
  • (2013)How many multiword expressions do people know?ACM Transactions on Speech and Language Processing 10.1145/2483691.248369310:2(1-13)Online publication date: 21-Jun-2013
  • (2012)Attacking parsing bottlenecks with unlabeled data and relevant factorizationsProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 110.5555/2390524.2390633(768-776)Online publication date: 8-Jul-2012
  • (2012)Discovering factions in the computational linguistics communityProceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries10.5555/2390507.2390511(22-32)Online publication date: 10-Jul-2012
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
HLT '11: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1
June 2011
1696 pages
ISBN:9781932432879

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 19 June 2011

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 240 of 768 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)39
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2013)How many multiword expressions do people know?ACM Transactions on Speech and Language Processing 10.1145/2483691.248369310:2(1-13)Online publication date: 21-Jun-2013
  • (2012)Attacking parsing bottlenecks with unlabeled data and relevant factorizationsProceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 110.5555/2390524.2390633(768-776)Online publication date: 8-Jul-2012
  • (2012)Discovering factions in the computational linguistics communityProceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries10.5555/2390507.2390511(22-32)Online publication date: 10-Jul-2012
  • (2011)How many multiword expressions do people know?Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World10.5555/2021121.2021152(137-144)Online publication date: 23-Jun-2011

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media