[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.3115/974358.974374dlproceedingsArticle/Chapter ViewAbstractPublication PagesanlcConference Proceedingsconference-collections
Article
Free access

Exploiting sophisticated representations for document retrieval

Published: 13 October 1994 Publication History

Abstract

The use of NLP techniques for document classification has not produced significant improvements in performance within the standard term weighting statistical assignment paradigm (Fagan 1987; Lewis, 1992bc; Buckley, 1993). This perplexing fact needs both an explanation and a solution if the power of recently developed NLP techniques are to be successfully applied in IR. A novel method for adding linguistic annotation to corpora is presented which involves using a statistical POS tagger in conjunction with unsupervised structure finding methods to derive notions of "noun group", "verb group", and so on which is inherently extensible to more sophisticated annotation, and does not require a pre-tagged corpus to fit. One of the distinguishing features of a more linguistically sophisticated representation of documents over a word set based representation of them is that linguistically sophisticated units are more frequently individually good predictors of document descriptors (keywords) than single words are. This leads us to consider the assignment of descriptors from individual phrases rather than from the weighted sum of a word set representation. We investigate how sets of individually high-precision rules can result in a low precision when used together, and develop some theory about these probably-correct rules. We then proceed to repeat results which show that standard statistical models are not particularly suitable for exploiting linguistically sophisticated representations, and show that a statistically fitted rule-based model provides significantly improved performance for sophisticated representations. It therefore shows that statistical systems can exploit sophisticated representations of documents, and lends some support to the use of more linguistically sophisticated representations for document classification. This paper reports on work done for the LRE project SISTA, which is creating a PC based tool to be used in the technical abstracting industry.

References

[1]
Adams, E. (1975) The Logic of Conditionals: an application of probability to deductive logic Reidel.
[2]
Apte, C, F. Demerau & S. Weiss (1994) Towards Language Independent Automated Learning of Text Categorization Methods. the proceeding of the Seventeenth ACM-SIGIR Conference on Information Retrieval. 23--30, DCU, Dublin.
[3]
Buckley, C. (1993) The Importance of Proper Weighting Methods. ARPA Workshop on Human Language Technology.
[4]
Church, K. (1988) A stochastic parts program and noun phrase parser for unrestricted text. In Second conference on applied NLP, pp. 136--43.
[5]
Church, K., W. Gale, P. Hanks & D. Hindle (1989) Parsing, Word Associations and Typical Predicate-Argument Relations. In International Parsing Technologies Workshop. CMU, Pittsburgh.
[6]
Fagan, J. (1987) Experiments in Automatic Phrase Indexing for Document Retrieval: Comparison of Syntactic and Non-Syntactic Methods. PhD Thesis. Cornell University, Dept. of Computer Science.
[7]
Finch, S. P. & N. Chater (1991) A Hybrid Approach to the Automatic Learning of Linguistic Categories. Artificial Intelligence and Simulated Behaviour Quarterly. 78 16--24.
[8]
Finch, S. (1993) Finding Structure in Language. Ph.D. thesis, Centre for Cognitive Science, University of Edinburgh, Edinburgh.
[9]
Fuhr, N. (1989) Models for retrieval with probabilistic indexing. Information processing and management. 25(1): 55--72.
[10]
Fuhr, N. & Buckley, C (1993) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models First TREC Conference.
[11]
Hindle, D. (1990) Noun Classification from Predicate-Argument Structures. In Proceedings of the 22nd meeting of the Association of Computational Linguistics. 268--75.
[12]
Jacobs, P. & Rau, L. (1990) SCISOR: Extracting Information from On-line News Correspondence of the ACM 33 11 88--97
[13]
Kupiec, J. (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:3 225--42.
[14]
Lewis, D. (1991) Evaluating text categorisation. In Speech and natural language workshop. pp. 136--143.
[15]
Lewis, D. (1992a) Representation and learning in information retrieval. Ph.D. thesis, Computer Science Dept., Univ. Mass., Amherst, Ma.
[16]
Lewis, D. (1992b) An Evaluation of Phrasal and Clustered Representations on a Text categorization problem. Proceedings of SIGIR 92.
[17]
Lewis, D. (1992c) Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop held at Harrimn, NY. pp. 212--217.
[18]
Lewis, D. & K. Sparck-Jones (1993) Natural language processing for information retrieval University of Cambridge Technical report 307, Cambridge.
[19]
Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kaufmann, San Mateo, Ca.
[20]
van Rijsbergen, C. J. (1979) Information retrieval. Butterworths, London.
[21]
Sacks-Davis, R. (1990) Using Syntactic Analysis in a Document Retrieval System that Uses Signature Files. ACM SIGIR-90.
[22]
Salton, G. & McGill, M. J. (1983) Introduction to modern information retrieval. McGraw-Hill, NY.
[23]
Salton, G. & C. Buckley (1988) Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management24 5 513--23
[24]
Zadeh, L. (1965) Fuzzy Sets Information and control, bf 8 338--53.

Cited By

View all
  • (2019)Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems10.1023/A:100862363244310:3(281-300)Online publication date: 1-Jun-2019
  • (1995)Partial orders for document representationProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/215206.215369(264-272)Online publication date: 1-Jul-1995

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings
ANLC '94: Proceedings of the fourth conference on Applied natural language processing
October 1994
226 pages

Sponsors

  • ACL: Association for Computational Linguistics
  • Gesellschaft ffir Informatik

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 13 October 1994

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)4
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2019)Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems10.1023/A:100862363244310:3(281-300)Online publication date: 1-Jun-2019
  • (1995)Partial orders for document representationProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/215206.215369(264-272)Online publication date: 1-Jul-1995

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media