Abstract
Lexical collocations have particular statistical distributions. We have developed a set of statistical techniques for retrieving and identifying collocations from large textual corpora. The techniques we developed are able to identify collocations of arbitrary length as well as flexible collocations. These techniques have been implemented in a lexicographic tool, Xtract, which is able to automatically acquire collocations with high retrieval performance. Xtract works in three stages. The first stage is based on a statistical technique for identifying word pairs involved in a syntactic relation. The words can appear in the text in any order and can be separated by an arbitrary number of other words. The second stage is based on a technique to extract n-word collocations (or n-grams) in a much simpler way than related methods. These collocations can involve closed class words such as particles and prepositions. A third stage is then applied to the output of stage one and applies parsing techniques to sentences involving a given word pair in order to identify the proper syntactic relation between the two words. A secondary effect of the third stage is to filter out a number of candidate collocations as irrelevant and thus produce higher quality output. In this paper we present an overview of Xtract and we describe several uses for Xtract and the knowledge it retrieves such as language generation and machine translation.
Similar content being viewed by others
References
Abney, S. “Rapid Incremental Parsing with Repair.” In Waterloo Conference on Electronic Text Research. 1990.
Benson, M., E. Benson, and R. Ilson. The BBI Combinatory Dictionary of English: A Guide to Word Combinations. Amsterdam and Philadelphia: John Benjamins, 1986.
Benson, M. “The Structure of the Collocational Dictionary.” International Journal of Lexicography, 2 (1989), 1–14.
Benson, M. “Collocations and General-Purpose Dictionaries.” International Journal of Lexicography, 3, 1 (1990), 23–25.
Choueka, Y. “Looking for Needles in a Haystack.” In Proceedings of the RIAO Conference on User-oriented Context Based Text and Image Handling. Cambridge, MA, 1988, pp. 609–23.
Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” In Proceedings of the 27th meeting of the ACL. Association for Computational Linguistics, 1989, pp. 76–83. Also in Computational Linguistics, 16, 1 (1990).
Church, K., and P. Hanks. “Word Association Norms, Mutual Information and Lexicography.” Computational Linguistics, 16, 1 (1990), 22–29.
Church, K.W., W. Gale, P. Hanks, and D. Hindle. “Parsing, Word Associations and Typical Predicate-Argument Relations.” In Proceedings of the International Workshop on Parsing Technologies. {cpCarnegie Mellon University, Pittsburgh, PA}, 1989. Also appears in Current Issues in Parsing Technology. Ed. Marasu Tomita. Boston, MA: Kluwer, 1991, pp. 103–12.
Church, K., W. Gale, P. Hanks, and D. Hindle. “Using Statistics in Lexical Analysis. In Lexical Acquisition: Using on-line resources to build a lexicon. Ed. Uri Zernak. Lawrence Erlbaum, 1991. In press.
Church, K. “Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text.” In Proceedings of the Second Conference on Applied Natural Language Processing. Austin, Texas, 1988.
Cruse, D.A. Lexical Semantics. Cambridge: Cambridge University Press, 1986.
Debili, F. Analyse Syntactico-Sémantique Fondée sur une Acquisition Automatique de Relations Lexicales Séman-tiques. PhD thesis (thèse de Doctorat d'État). Paris XI University, Orsay, France, 1982.
Elhadad, M. “Types in Functional Unification Grammars.” In Proceedings of the 28th meeting of the ACL. Association for Computational Linguistics, 1990.
Flexner, S., ed. The Random House Dictionary of the English Language. 2nd ed. New York: Random House, 1987.
Francis, W., and H. Kucera. Frequency Analysis of English Usage. Boston, MA: Houghton Mifflin Company, 1982.
Gale, W., and K. Church. “A Program for Aligning Sentences in Bilingual Corpora.” In Proceedings of the 29th meeting of the ACL. Association for Computational Linguistics, 1991, pp. 177–84.
Halliday, M.A.K. “Lexis as a Linguistic Level.” In In Memory of J.R. Firth. London: Longmans, 1966, pp. 148–62.
Hindle, D., and M. Rooth. “Structural Ambiguity and Lexical Relations.” In DARPA Speech and Natural Language Workshop. Hidden Valley, PA, June 1990.
Hindle, D. “User Manual for Fidditch, A Deterministic Parser.” Technical Memorandum 7590-142. Naval Research Laboratory, 1983.
Kay, M. “Functional Unification Grammars: A Formalism for Machine Translation.” In Proceedings of the 10th COLING. Stanford University, 1983, pp. 75–78.
Klingbiel, P.H. “Machine-Aided Indexing of Technical Literature.” Information Storage and Retrieval, 9 (1973), 79–84.
Kukich, K. Knowledge-Based Report Generation: A Knowledge Engineering Approach to Natural Language Report Generation. PhD thesis. Information Science Department, University of Pittsburgh, 1983.
Maarek, Y., and F. Smadja. “Full Text Indexing Based on Lexical Relations, An Application: Software Libraries.” In Proceedings of ACM SIGIR. Cambridge, June 1989, pp. 198–206.
Maarek, Y.S., D.M. Berry, and G.E. Kaiser. “An Information Retrieval Approach for Automatically Constructing Software Libraries.” IEEE Transactions on Software Engineering. August, 1991.
Maarek, Y.S. “An Incremental Conceptual Clustering Algorithm with Input-Ordering Bias Correction.” In Advances in Artificial Intelligence, Natural Language and Knowledge Base Systems. Ed. M.C. Golumbic. Springer Verlag, 1990.
Martin, W.J.R., B.P.F. Al, and P.J.G. Van Sterkenburg. “On the Processing of a Text Corpus: From Textual Data to Lexicographical Information.” In Lexicography: Principles and Practice. Ed. R.R.K. Hartman. London: Academic Press, 1983.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik. A Comprehensive Grammar of the English Language. Longman, 1972.
Salton, G., and M.J. McGill. Introduction to Modern Information Retrieval. New York: McGraw Hill, 1983.
Salton, J. Automatic Text Processing, The Transformation, Analysis, and Retrieval of Information by Computer. New York: Addison-Wesley, 1989.
Smadja, F., and K. McKeown. “Automatically Extracting and Representing Collocations for Language Generation.” In Proceedings of the 28th Annual Meeting of the ACL. Pittsburgh, PA, June 1990.
Smadja, F., and K. McKeown. “Using Collocations for Language Generation.” Technical Report. Columbia University, NY, December 1991.
Smadja, F. “Lexical Co-occurrence, The Missing Link in Language Acquisition.” In Program and Abstracts of the 15th International ALLC Conference of the Association for Literary and Linguistic Computing. Jerusalem, Israel, June 1988.
Smadja, F. “From N-Grams to Collocations an Evaluation of Xtract.” In Proceedings of the 29th Annual Meeting of the ACL. UC Berkeley, CA, June 1991.
Smadja, F. Retrieving Collocational Knowledge from Textual Corpora. An Application: Language Generation. PhD thesis. Computer Science Department, Columbia University, New York, NY, 1991.
Sparck Jones, K., and J.I. Tait. “Automatic Search Variant Generation.” Journal of Documentation, 40, 1 (1984), 50–66.
Author information
Authors and Affiliations
Additional information
Frank Smadja is in the Department of Computer Science at Columbia University and has been working on lexical collocations for his doctoral thesis.
Rights and permissions
About this article
Cite this article
Smadja, F. Xtract: An overview. Comput Hum 26, 399–413 (1992). https://doi.org/10.1007/BF00136983
Issue Date:
DOI: https://doi.org/10.1007/BF00136983