Abstract
In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and ”eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words.
We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.
We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1), 11–73.
Brown, P. F., DellaPietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467-479.
Cardie, C. (1993). A case-based approach to knowledge acquisition for domain-specific sentence analysis. In 11th National Conference on Artifical Intelligence (pp. 798–803). Menlo Park, California: AAAI.
Chen, S. F., & Goodman, J. T. (1996). An empirical study of smoothing techniques for language modeling. In 34th Annual Meeting of the ACL (p. 310-318). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing (p. 136-143). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Church, K. W., & Gale, W. A. (1991). A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5, 19–54.
Cover, T. M., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John Wiley.
Dagan, I., Lee, L., & Pereira, F. C. N. (1997). Similarity-based methods for word sense disambiguation. In 35th Annual Meeting of the ACL (pp. 56–63). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Dagan, I., Marcus, S., & Markovitch, S. (1993). Contextual word similarity and estimation from sparse data. In 31st Annual Meeting of the ACL (pp. 164–171). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Dagan, I., Marcus, S., & Markovitch, S. (1995). Contextual word similarity and estimation from sparse data. Computer Speech and Language, 9, 123–152.
Dagan, I., Pereira, F. C. N., & Lee, L. (1994). Similarity-based estimation of word cooccurrence probabilities. In 32nd Annual Meeting of the ACL (p. 272-278). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag.
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience.
Essen, U., & Steinbiss, V. (1992). Co-occurrence smoothing for stochastic language modeling. In ICASSP 92 (Vol. 1, p. 161-164). Piscataway, New Jersey: IEEE.
Finch, S. (1993). Finding structure in language. Unpublished doctoral dissertation, University of Edinburgh.
Gale, W., Church, K. W., & Yarowsky, D. (1992). Work on statistical methods for word sense disambiguation. In R. Goldman (Ed.), Fall Symposium on Probabilistic Approaches to Natural Language (pp. 54–60). Menlo Park, California: AAAI.
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4), 237-264.
Grefenstette, G. (1992). Use of syntactic context to produce term association lists for text retrieval. In International conference on research and development in information retrieval, SIGIR (p. 89-97). New York: ACM.
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Boston: Kluwer Academic Publishers.
Grishman, R., Hirschman, L., & Nhan, N. T. (1986). Discovery procedures for sublanguage selectional patterns – initial experiments. Computational Linguistics, 12, 205–214.
Grishman, R., & Sterling, J. (1993). Smoothing of automatically generated selectional constraints. In Human Language Technology: Proceedings of the ARPA Workshop (pp. 254–259). San Francisco: Morgan Kaufmann.
Hindle, D. (1990). Noun classification from predicate-argument structures. In 28th Annual Meeting of the ACL (pp. 268–275). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam: North Holland.
Jelinek, F., Mercer, R. L., & Roukos, S. (1992). Principles of lexical language modeling for speech recognition. In S. Furui & M. M. Sondhi (Eds.), Advances in speech signal processing (pp. 651–699). New York: Marcel Dekker.
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In ROCLING X International Conference. Tapei, Taiwan: Academia Sinica.
Karov, Y., & Edelman, S. (1996). Learning similarity-based word sense disambiguation from sparse data. In E. Ejerhed & I. Dagan (Eds.), Fourth Workshop on Very Large Corpora (pp. 42–55). Somerset, New Jersey: Association for Computational Linguistics.
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3), 400-401.
Kneser, R., & Ney, H. (1993). Improved clustering techniques for class-based statistical language modelling. In EUROSPEECH'93 (pp. 973–976). Grenoble, France: European Speech Communication Association.
Kullback, S. (1959). Information theory and statistics. New York: John Wiley and Sons.
Lee, L. (1997). Similarity-based approaches to natural language processing. Unpublished doctoral dissertation, Harvard University, Cambridge, Massachusetts.
Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. In 35th Annual Meeting of theACL (pp. 64–71). Somerset, NewJersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Lin, D. (1998). An information theoretic definition of similarity. In Machine Learning: Proceedings of the Fiftheenth International Conference (ICML '98). San Francisco: Morgan Kaufmann.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Luk, A. K. (1995). Statistical sense disambiguation with relatively small corpora using dictionary definitions. In 33rd Annual Meeting of the ACL (pp. 181–188). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: an online lexical database. International Journal of Lexicography, 3(4), 235–244.
Ng, H. T. (1997). Exemplar-based word sense disambiguation: Some recent improvements. In C. Cardie & R. Weischedel (Eds.), Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (pp. 208–213). Somerset, New Jersey: Association for Computational Linguistics.
Ng, H. T., & Lee, H. B. (1996). Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In 34th Annual Meeting of the ACL (pp. 40–47). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Paul, D. B. (1991). Experience with a stack decoder-based HMM CSR and back-off n-gram language models. In ARPA Speech and Natural Language Workshop (pp. 284–288). San Francisco: Morgan Kaufmann.
Pereira, F. C. N., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In 31st Annual Meeting of the ACL (p. 183-190). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Rao, C. R. (1982). Diversity: Its measurement, decomposition, apportionment and analysis. Sankyhā: The Indian Journal of Statistics, 44(A), 1-22.
Resnik, P. (1992). Wordnet and distributional analysis: A class-based approach to lexical discovery. In Workshop on Statistically-based Natural Language Processing Techniques (p. 56-64). Menlo Park, California: AAAI.
Resnik, P. (1995). Disambiguating noun groupings with respect to WordNet senses. In D. Yarowsky & K. W. Church (Eds.), Third Workshop on Very Large Corpora (pp. 54–68). Somerset, New Jersey: Association for Computational Linguistics.
Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing & Management, 28(3), 317-332.
Saul, L., & Pereira, F. C. N. (1997). Aggregate and mixed-order Markov models for statistical language processing. In C. Cardie & R. Weischedel (Eds.), Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (pp. 81–89). Somerset, New Jersey: Association for Computational Linguistics.
Schütze, H. (1992a). Context space. In R. Goldman (Ed.), Fall Symposium on Probabilistic Approaches to Natural Language (pp. 113–120). Menlo Park, California: AAAI.
Schütze, H. (1992b). Dimensions of meaning. In Supercomputing '92: Proceedings of the ACM/IEEE Conference (p. 787-796). New York: ACM.
Schütze, H. (1993). Word space. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 895–902). San Francisco: Morgan Kaufmann.
Stanfill, C., & Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29(12), 1213-1228.
Sugawara, K., Nishimura, M., Toshioka, K., Okochi, M., & Kaneko, T. (1985). Isolated word recognition using hidden Markov models. In ICASSP 85 (pp. 1–4). Piscataway, New Jersey: IEEE.
Ueberla, J. P. (1994). An extended clustering algorithm for statistical language models (Tech. Rep. No. DRA/CIS(CSE1)/RN94/13). Forum Technology – DRA Malvern.
Ushioda, A. (1996). Hierarchical clustering of words and applications to NLP tasks. In E. Ejerhed & I. Dagan (Eds.), Fourth Workshop on Very Large Corpora (pp. 28–41). Somerset, New Jersey: Association for Computational Linguistics.
Witten, I. H., & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.
Yarowsky, D. (1992). Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. In COLING-92 (pp. 454–460). Nantes, France.
Zavrel, J., & Daelemans, W. (1997). Memory-based learning: Using similarity for smoothing. In 35th Annual Meeting of the ACL (pp. 436–443). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dagan, I., Lee, L. & Pereira, F.C.N. Similarity-Based Models of Word Cooccurrence Probabilities. Machine Learning 34, 43–69 (1999). https://doi.org/10.1023/A:1007537716579
Issue Date:
DOI: https://doi.org/10.1023/A:1007537716579