[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.3115/991250.991338dlproceedingsArticle/Chapter ViewAbstractPublication PagescolingConference Proceedingsconference-collections
Article
Free access

A grammatico-statistical approach to discourse partitioning

Published: 05 August 1994 Publication History

Abstract

The paper presents a new approach to text segmentation - which concerns dividing a text into coherent discourse units. The approach builds on the theory of discourse segment (Nomoto and Nitta, 1993), incorporating ideas from the research on information retrieval (Salton, 1988). A discourse segment has to do with a structure of Japanese discourse; it could be thought of as a linguistic unit demarcated by wa, a Japanese topic particle, which may extend over several sentences. The segmentation works with discourse segments and makes use of coherence measure based on tf-idf, a standard information retrieval measurement (Salton, 1988; Hearst, 1993). Experiments have been done with a Japanese newspaper corpus. It has been found that the present approach is quite successful in recovering articles from the unstructured corpus.

References

[1]
Shinji Fujisawa, Shigern Masuyama, and Shozo Naito. 1993. An Inspection on Effect of Discourse Contraints pertaining to Ellipsis Supplement in Japanese Sentences. In Kouen-Ronbun-Shuu 3 (conference papers 3). Information Processing Society of Japan. In Japanese.
[2]
Barbara Grosz and Candance Sidner. 1986. Attention, Intentions and the Structure of Discourse. Computational Linguistics, 12(3):175--204.
[3]
Marti A. Hearst. 1993. TextTiling: A Quantitative Approach to Discourse Segmentation. Sequoia 2000 93/24, University of California, Berkeley.
[4]
Jerry R. Hobbs. 1979. Coherence and Corefernce. Cognitive Science, 3(1):67--90.
[5]
Eduard H. Hovy. 1990. Parsimonious and Profligate Approaches to the Question of Discourse Structure Relations. In 5th ACL Workshop on Natural Language Generation, Dawson, Pennsylvania.
[6]
Hideki Kozima. 1993. Text Segmentation Based on Similarity Between Words. In Proceedings of the 31st Annual Meeting of the ACL.
[7]
W. C. Mann and S. A. Thompson. 1987. Rhetorical Structure Theory. In L. Polyani, editor, The Structure of Discourse. Ablex Publishing Corp., Norwood, NJ.
[8]
Yuji Matsumoto, Sadao Kurohashi, Takehito Utsuro, Yutaka Taeki, and Makoto Nagao, 1993. Japanese Morphological Analysis System JUMAN Manual. Kyoto University. In Japanese.
[9]
Akira Mikami. 1960. Zou wa Hana ga Nagai (The elephant has a long trunk.). Kuroshio Shuppan, Tokyo.
[10]
Tadashi Nomoto and Yoshihiko Nitta. 1993. Resolving Zero Anaphora in Japanese. In ACL Proceedings of Sixth European Conference, pages 315--321, Utrecht, The Netherlands.
[11]
Geoffrey Nunberg. 1990. The Linguistics of Punctuation, volume 18 of CSLI Lecture notes. CSLI.
[12]
Rebecca J. Passonneau and Diane J. Litman. 1993. Intention-based Segmentation: Human Reliability and Correlation with Linguistic Cues. In Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics. The Association for Computational Linguistics. Ohio State University, Columbus, Ohio, USA.
[13]
Gerald Salton. 1988. Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, M. A.
[14]
Takenobu Tokunaga and Makoto Iwayama. 1994. Text Categorization based on Weighted Inverse Document Frequency. unpublished manuscript. submitted to ACM SIGIR 1994.
[15]
Gilbert Youmans. 1991. A New Tool for Discourse Analysis:The Vocabulary Management Profile. Language, 67:763--789.

Cited By

View all
  • (2007)Text segmentation based on document understanding for information retrievalProceedings of the 12th international conference on Applications of Natural Language to Information Systems10.5555/2394705.2394740(295-304)Online publication date: 27-Jun-2007
  • (2007)A new hybrid summarizer based on vector space model, statistical physics and linguisticsProceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence10.5555/1775967.1776057(872-882)Online publication date: 4-Nov-2007
  • (2000)Nine Issues in Speech TranslationMachine Translation10.1023/A:101118092851315:1/2(149-186)Online publication date: 1-Jun-2000
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
COLING '94: Proceedings of the 15th conference on Computational linguistics - Volume 2
August 1994
661 pages

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 05 August 1994

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 1,537 of 1,537 submissions, 100%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 11 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2007)Text segmentation based on document understanding for information retrievalProceedings of the 12th international conference on Applications of Natural Language to Information Systems10.5555/2394705.2394740(295-304)Online publication date: 27-Jun-2007
  • (2007)A new hybrid summarizer based on vector space model, statistical physics and linguisticsProceedings of the artificial intelligence 6th Mexican international conference on Advances in artificial intelligence10.5555/1775967.1776057(872-882)Online publication date: 4-Nov-2007
  • (2000)Nine Issues in Speech TranslationMachine Translation10.1023/A:101118092851315:1/2(149-186)Online publication date: 1-Jun-2000
  • (1998)Thematic segmentation of textsProceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 110.3115/980845.980912(392-396)Online publication date: 10-Aug-1998
  • (1998)How to thematically segment texts by using lexical cohesion?Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 210.3115/980691.980813(1481-1483)Online publication date: 10-Aug-1998
  • (1998)Cut as a querying unit for WWW, Netnews, and E-mailProceedings of the ninth ACM conference on Hypertext and hypermedia : links, objects, time and space---structure in hypermedia systems: links, objects, time and space---structure in hypermedia systems10.1145/276627.276653(235-244)Online publication date: 1-May-1998
  • (1997)TextTilingComputational Linguistics10.5555/972684.97268723:1(33-64)Online publication date: 1-Mar-1997

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media