More Web Proxy on the site http://driver.im/

Article

Free access

Exploiting sophisticated representations for document retrieval

Author:

Steven FinchAuthors Info & Claims

ANLC '94: Proceedings of the fourth conference on Applied natural language processing

Pages 65 - 71

https://doi.org/10.3115/974358.974374

Published: 13 October 1994 Publication History

Abstract

The use of NLP techniques for document classification has not produced significant improvements in performance within the standard term weighting statistical assignment paradigm (Fagan 1987; Lewis, 1992bc; Buckley, 1993). This perplexing fact needs both an explanation and a solution if the power of recently developed NLP techniques are to be successfully applied in IR. A novel method for adding linguistic annotation to corpora is presented which involves using a statistical POS tagger in conjunction with unsupervised structure finding methods to derive notions of "noun group", "verb group", and so on which is inherently extensible to more sophisticated annotation, and does not require a pre-tagged corpus to fit. One of the distinguishing features of a more linguistically sophisticated representation of documents over a word set based representation of them is that linguistically sophisticated units are more frequently individually good predictors of document descriptors (keywords) than single words are. This leads us to consider the assignment of descriptors from individual phrases rather than from the weighted sum of a word set representation. We investigate how sets of individually high-precision rules can result in a low precision when used together, and develop some theory about these probably-correct rules. We then proceed to repeat results which show that standard statistical models are not particularly suitable for exploiting linguistically sophisticated representations, and show that a statistically fitted rule-based model provides significantly improved performance for sophisticated representations. It therefore shows that statistical systems can exploit sophisticated representations of documents, and lends some support to the use of more linguistically sophisticated representations for document classification. This paper reports on work done for the LRE project SISTA, which is creating a PC based tool to be used in the technical abstracting industry.

References

[1]

Adams, E. (1975) The Logic of Conditionals: an application of probability to deductive logic Reidel.

[2]

Apte, C, F. Demerau & S. Weiss (1994) Towards Language Independent Automated Learning of Text Categorization Methods. the proceeding of the Seventeenth ACM-SIGIR Conference on Information Retrieval. 23--30, DCU, Dublin.

Digital Library

[3]

Buckley, C. (1993) The Importance of Proper Weighting Methods. ARPA Workshop on Human Language Technology.

Digital Library

[4]

Church, K. (1988) A stochastic parts program and noun phrase parser for unrestricted text. In Second conference on applied NLP, pp. 136--43.

Digital Library

[5]

Church, K., W. Gale, P. Hanks & D. Hindle (1989) Parsing, Word Associations and Typical Predicate-Argument Relations. In International Parsing Technologies Workshop. CMU, Pittsburgh.

Digital Library

[6]

Fagan, J. (1987) Experiments in Automatic Phrase Indexing for Document Retrieval: Comparison of Syntactic and Non-Syntactic Methods. PhD Thesis. Cornell University, Dept. of Computer Science.

[7]

Finch, S. P. & N. Chater (1991) A Hybrid Approach to the Automatic Learning of Linguistic Categories. Artificial Intelligence and Simulated Behaviour Quarterly. 78 16--24.

[8]

Finch, S. (1993) Finding Structure in Language. Ph.D. thesis, Centre for Cognitive Science, University of Edinburgh, Edinburgh.

[9]

Fuhr, N. (1989) Models for retrieval with probabilistic indexing. Information processing and management. 25(1): 55--72.

Digital Library

[10]

Fuhr, N. & Buckley, C (1993) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models First TREC Conference.

[11]

Hindle, D. (1990) Noun Classification from Predicate-Argument Structures. In Proceedings of the 22nd meeting of the Association of Computational Linguistics. 268--75.

Digital Library

[12]

Jacobs, P. & Rau, L. (1990) SCISOR: Extracting Information from On-line News Correspondence of the ACM 33 11 88--97

Digital Library

[13]

Kupiec, J. (1992) Robust part-of-speech tagging using a hidden Markov model. Computer Speech and Language, 6:3 225--42.

[14]

Lewis, D. (1991) Evaluating text categorisation. In Speech and natural language workshop. pp. 136--143.

[15]

Lewis, D. (1992a) Representation and learning in information retrieval. Ph.D. thesis, Computer Science Dept., Univ. Mass., Amherst, Ma.

Digital Library

[16]

Lewis, D. (1992b) An Evaluation of Phrasal and Clustered Representations on a Text categorization problem. Proceedings of SIGIR 92.

Digital Library

[17]

Lewis, D. (1992c) Feature selection and feature extraction for text categorization. In Speech and Natural Language: Proceedings of a Workshop held at Harrimn, NY. pp. 212--217.

Digital Library

[18]

Lewis, D. & K. Sparck-Jones (1993) Natural language processing for information retrieval University of Cambridge Technical report 307, Cambridge.

[19]

Pearl, J. (1988) Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan Kaufmann, San Mateo, Ca.

Digital Library

[20]

van Rijsbergen, C. J. (1979) Information retrieval. Butterworths, London.

Digital Library

[21]

Sacks-Davis, R. (1990) Using Syntactic Analysis in a Document Retrieval System that Uses Signature Files. ACM SIGIR-90.

Digital Library

[22]

Salton, G. & McGill, M. J. (1983) Introduction to modern information retrieval. McGraw-Hill, NY.

Digital Library

[23]

Salton, G. & C. Buckley (1988) Term Weighting Approaches in Automatic Text Retrieval Information Processing and Management24 5 513--23

Digital Library

[24]

Zadeh, L. (1965) Fuzzy Sets Information and control, bf 8 338--53.

Cited By

Feldman RDagan IHirsh H(2019)Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems10.1023/A:100862363244310:3(281-300)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1023/A%3A1008623632443
Finch S(1995)Partial orders for document representationProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/215206.215369(264-272)Online publication date: 1-Jul-1995
https://dl.acm.org/doi/10.1145/215206.215369

Exploiting sophisticated representations for document retrieval
1. Computing methodologies
  1. Artificial intelligence
2. Hardware
  1. Power and energy
    1. Power estimation and optimization

Recommendations

Exploiting noun phrases and semantic relationships for text document clustering

Text document clustering plays an important role in providing better document retrieval, document browsing, and text mining. Traditionally, clustering techniques do not consider the semantic relationships between words, such as synonymy and hypernymy. ...
Bilingual distributed word representations from document-aligned comparable data

We propose a new model for learning bilingual word representations from nonparallel document-aligned data. Following the recent advances in word representation learning, our model learns dense real-valued word vectors, that is, bilingual word embeddings ...
A methodology for exploiting sophisticated representations for classification
RIAO '94: Intelligent Multimedia Information Retrieval Systems and Management - Volume 1

The use of NLP techniques for document classification has not produced significant inprovements in performance within the standard term weighting statistical assignment paradigm (Fagan 1987; Lewis, 1992ab; Lewis and Sparck-Jones 1993; Buckley, 1993). ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image DL Hosted proceedings

ANLC '94: Proceedings of the fourth conference on Applied natural language processing

October 1994

226 pages

Program Chair:
Paul Jacobs
Philadelphia, Pennsylvania

Sponsors

ACL: Association for Computational Linguistics
Gesellschaft ffir Informatik

Publisher

Association for Computational Linguistics

United States

Publication History

Published: 13 October 1994

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
295
Total Downloads

Downloads (Last 12 months)46
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Feldman RDagan IHirsh H(2019)Mining Text Using Keyword DistributionsJournal of Intelligent Information Systems10.1023/A:100862363244310:3(281-300)Online publication date: 1-Jun-2019
https://dl.acm.org/doi/10.1023/A%3A1008623632443
Finch S(1995)Partial orders for document representationProceedings of the 18th annual international ACM SIGIR conference on Research and development in information retrieval10.1145/215206.215369(264-272)Online publication date: 1-Jul-1995
https://dl.acm.org/doi/10.1145/215206.215369

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents