[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/2380943.2380945dlproceedingsArticle/Chapter ViewAbstractPublication PageseaclConference Proceedingsconference-collections
research-article
Free access

Cross-lingual genre classification

Published: 26 April 2012 Publication History

Abstract

Classifying text genres across languages can bring the benefits of genre classification to the target language without the costs of manual annotation. This article introduces the first approach to this task, which exploits text features that can be considered stable genre predictors across languages. My experiments show this method to perform equally well or better than full text translation combined with monolingual classification, while requiring fewer resources.

References

[1]
Shlomo Argamon, Moshe Koppel, and Galit Avneri. 1998. Routing documents according to style. In Proceedings of First International Workshop on Innovative Information Systems.
[2]
Nuria Bel, Cornelis Koster, and Marta Villegas. 2003. Cross-lingual text categorization. In Traugott Koch and Ingeborg Slvberg, editors, Research and Advanced Technology for Digital Libraries, volume 2769 of Lecture Notes in Computer Science, pages 126--139. Springer Berlin/Heidelberg.
[3]
Taylor Berg-Kirkpatrick, Alexandre Bouchard-Côté, John DeNero, and Dan Klein. 2010. Painless unsupervised learning with features. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT '10, pages 582--590, Stroudsburg, PA, USA. Association for Computational Linguistics.
[4]
Douglas Biber. 1988. Variation across Speech and Writing. Cambridge University Press, Cambridge.
[5]
Douglas Biber. 1995. Dimensions of Register Variation. Cambridge University Press, New York.
[6]
Pavel Braslavski. 2004. Document style recognition using shallow statistical analysis. In Proceedings of the ESSLLI 2004 Workshop on Combining Shallow and Deep Processing for NLP, pages 1--9.
[7]
Jebari Chaker and Ounelli Habib. 2007. Genre categorization of web pages. In Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, ICDMW '07, pages 455--464, Washington, DC, USA. IEEE Computer Society.
[8]
Alexander Clark. 2003. Combining distributional and morphological information for part of speech induction. In Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 1, EACL '03, pages 59--66, Stroudsburg, PA, USA. Association for Computational Linguistics.
[9]
O. de Vel, A. Anderson, M. Corney, and G. Mohay. 2001. Mining e-mail content for author identification forensics. SIGMOD Rec., 30(4):55--64.
[10]
A. P. Dempster, N. M. Laird, and D. B. Rubin. 1977. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38.
[11]
S. Feldman, M. A. Marin, M. Ostendorf, and M. R. Gupta. 2009. Part-of-speech histograms for genre classification of text. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4781--4784, Washington, DC, USA. IEEE Computer Society.
[12]
Aidan Finn and Nicholas Kushmerick. 2006. Learning to classify documents according to genre. J. Am. Soc. Inf. Sci. Technol., 57(11):1506--1518.
[13]
Luanne Freund, Charles L. A. Clarke, and Elaine G. Toms. 2006. Towards genre classification for IR in the workplace. In Proceedings of the 1st international conference on Information interaction in context, pages 30--36, New York, NY, USA. ACM.
[14]
Alfio Gliozzo and Carlo Strapparava. 2006. Exploiting comparable corpora and bilingual dictionaries for cross-language text categorization. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, ACL-44, pages 553--560, Stroudsburg, PA, USA. Association for Computational Linguistics.
[15]
Jade Goldstein, Gary M. Ciany, and Jaime G. Carbonell. 2007. Genre identification and goal-focused summarization. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, CIKM '07, pages 889--892, New York, NY, USA. ACM.
[16]
Sharon Goldwater and Tom Griffiths. 2007. A fully Bayesian approach to unsupervised part-of-speech tagging. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 744--751, Prague, Czech Republic, June. Association for Computational Linguistics.
[17]
Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin. 2000. A Practical Guide to Support Vector Classification.
[18]
Thorsten Joachims. 1998. Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137--142, London, UK. Springer-Verlag.
[19]
Ioannis Kanaris and Efstathios Stamatatos. 2007. Webpage genre identification using variable-length character n-grams. In Proceedings of the 19th IEEE International Conference on Tools with AI, pages 3--10, Washington, DC.
[20]
Jussi Karlgren and Douglass Cutting. 1994. Recognizing text genres with simple metrics using discriminant analysis. In Proceedings of the 15th Conference on Computational Linguistics, pages 1071--1075, Morristown, NJ, USA. Association for Computational Linguistics.
[21]
Brett Kessler, Geoffrey Nunberg, and Hinrich Schütze. 1997. Automatic detection of text genre. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 32--38, Morristown, NJ, USA. Association for Computational Linguistics.
[22]
Yunhyong Kim and Seamus Ross. 2008. Examining variations of prominent features in genre classification. In Proceedings of the Proceedings of the 41st Annual Hawaii International Conference on System Sciences, HICSS '08, pages 132--, Washington, DC, USA. IEEE Computer Society.
[23]
Tibor Kiss and Jan Strunk. 2006. Unsupervised multilingual sentence boundary detection. Comput. Linguist., 32:485--525, December.
[24]
Jon Oberlander and Scott Nowson. 2006. Whose thumb is it anyway?: classifying author personality from weblog text. In Proceedings of the COLING/ACL on Main conference poster sessions, COLING-ACL '06, pages 627--634, Morristown, NJ, USA. Association for Computational Linguistics.
[25]
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10, EMNLP '02, pages 79--86, Morristown, NJ, USA. Association for Computational Linguistics.
[26]
Philipp Petrenz and Bonnie Webber. 2011. Stable classification of text genres. Comput. Linguist., 37:385--393.
[27]
Peter Prettenhofer and Benno Stein. 2010. Cross-language text classification using structural correspondence learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, ACL '10, pages 1118--1127, Stroudsburg, PA, USA. Association for Computational Linguistics.
[28]
Leonardo Rigutini, Marco Maggini, and Bing Liu. 2005. An em based training algorithm for cross-language text categorization. In Proceedings of the Web Intelligence Conference, pages 529--535.
[29]
Evan Sandhaus. 2008. New York Times corpus: Corpus overview. LDC catalogue entry LDC2008T19.
[30]
Fabrizio Sebastiani. 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34:1--47, March.
[31]
Serge Sharoff, Zhili Wu, and Katja Markert. 2010. The Web Library of Babel: Evaluating genre collections. In Proceedings of the Seventh conference on International Language Resources and Evaluation, pages 3063--3070, Valletta, Malta, may. European Language Resources Association (ELRA).
[32]
Serge Sharoff. 2007. Classifying web corpora into domain and genre using automatic feature identification. In Proceedings of Web as Corpus Workshop.
[33]
E. Stamatatos, N. Fakotakis, and G. Kokkinakis. 2000a. Text genre detection using common word frequencies. In Proceedings of the 18th conference on Computational linguistics, pages 808--814, Morristown, NJ, USA. Association for Computational Linguistics.
[34]
Efstathios Stamatatos, George Kokkinakis, and Nikos Fakotakis. 2000b. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471--495.
[35]
Xiaojun Wan. 2009. Co-training for cross-lingual sentiment classification. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1 - Volume 1, ACL '09, pages 235--243, Stroudsburg, PA, USA. Association for Computational Linguistics.
[36]
Bonnie Webber. 2009. Genre distinctions for discourse in the Penn TreeBank. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pages 674--682.
[37]
Maria Wolters and Mathias Kirsten. 1999. Exploring the use of linguistic features in domain and genre classification. In Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics, EACL '99, pages 142--149, Stroudsburg, PA, USA. Association for Computational Linguistics.

Cited By

View all
  • (2012)Locational relativity and domain constraints in spatial questionsProceedings of the 20th International Conference on Advances in Geographic Information Systems10.1145/2424321.2424350(219-228)Online publication date: 6-Nov-2012

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
EACL '12: Proceedings of the Student Research Workshop at the 13th Conference of the European Chapter of the Association for Computational Linguistics
April 2012
99 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 26 April 2012

Qualifiers

  • Research-article

Acceptance Rates

Overall Acceptance Rate 100 of 360 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)2
Reflects downloads up to 12 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2012)Locational relativity and domain constraints in spatial questionsProceedings of the 20th International Conference on Advances in Geographic Information Systems10.1145/2424321.2424350(219-228)Online publication date: 6-Nov-2012

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media