Similarity-Based Models of Word Cooccurrence Probabilities

Ido Dagan¹,
Lillian Lee² &
Fernando C. N. Pereira³

2996 Accesses
147 Citations
3 Altmetric
Explore all metrics

Abstract

In many applications of natural language processing (NLP) it is necessary to determine the likelihood of a given word combination. For example, a speech recognizer may need to determine which of the two word combinations “eat a peach” and ”eat a beach” is more likely. Statistical NLP methods determine the likelihood of a word combination from its frequency in a training corpus. However, the nature of language is such that many word combinations are infrequent and do not occur in any given corpus. In this work we propose a method for estimating the probability of such previously unseen word combinations using available information on “most similar” words.

We describe probabilistic word association models based on distributional word similarity, and apply them to two tasks, language modeling and pseudo-word disambiguation. In the language modeling task, a similarity-based model is used to improve probability estimates for unseen bigrams in a back-off language model. The similarity-based method yields a 20% perplexity improvement in the prediction of unseen bigrams and statistically significant reductions in speech-recognition error.

We also compare four similarity-based estimation methods against back-off and maximum-likelihood estimation methods on a pseudo-word sense disambiguation task in which we controlled for both unigram and bigram frequency to avoid giving too much weight to easy-to-disambiguate high-frequency configurations. The similarity-based methods perform up to 40% better on this particular task.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

References

Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Google Scholar
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning. Artificial Intelligence Review, 11(1), 11–73.
Google Scholar
Brown, P. F., DellaPietra, V. J., deSouza, P. V., Lai, J. C., & Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467-479.
Google Scholar
Cardie, C. (1993). A case-based approach to knowledge acquisition for domain-specific sentence analysis. In 11th National Conference on Artifical Intelligence (pp. 798–803). Menlo Park, California: AAAI.
Google Scholar
Chen, S. F., & Goodman, J. T. (1996). An empirical study of smoothing techniques for language modeling. In 34th Annual Meeting of the ACL (p. 310-318). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Church, K. W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. In Second Conference on Applied Natural Language Processing (p. 136-143). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Church, K. W., & Gale, W. A. (1991). A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams. Computer Speech and Language, 5, 19–54.
Google Scholar
Cover, T. M., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1), 21–27.
Google Scholar
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. New York: John Wiley.
Google Scholar
Dagan, I., Lee, L., & Pereira, F. C. N. (1997). Similarity-based methods for word sense disambiguation. In 35th Annual Meeting of the ACL (pp. 56–63). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Dagan, I., Marcus, S., & Markovitch, S. (1993). Contextual word similarity and estimation from sparse data. In 31st Annual Meeting of the ACL (pp. 164–171). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Dagan, I., Marcus, S., & Markovitch, S. (1995). Contextual word similarity and estimation from sparse data. Computer Speech and Language, 9, 123–152.
Google Scholar
Dagan, I., Pereira, F. C. N., & Lee, L. (1994). Similarity-based estimation of word cooccurrence probabilities. In 32nd Annual Meeting of the ACL (p. 272-278). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Devroye, L., Györfi, L., & Lugosi, G. (1996). A probabilistic theory of pattern recognition. New York: Springer-Verlag.
Google Scholar
Duda, R. O., & Hart, P. E. (1973). Pattern classification and scene analysis. New York: Wiley-Interscience.
Google Scholar
Essen, U., & Steinbiss, V. (1992). Co-occurrence smoothing for stochastic language modeling. In ICASSP 92 (Vol. 1, p. 161-164). Piscataway, New Jersey: IEEE.
Google Scholar
Finch, S. (1993). Finding structure in language. Unpublished doctoral dissertation, University of Edinburgh.
Gale, W., Church, K. W., & Yarowsky, D. (1992). Work on statistical methods for word sense disambiguation. In R. Goldman (Ed.), Fall Symposium on Probabilistic Approaches to Natural Language (pp. 54–60). Menlo Park, California: AAAI.
Google Scholar
Good, I. J. (1953). The population frequencies of species and the estimation of population parameters. Biometrika, 40(3 and 4), 237-264.
Google Scholar
Grefenstette, G. (1992). Use of syntactic context to produce term association lists for text retrieval. In International conference on research and development in information retrieval, SIGIR (p. 89-97). New York: ACM.
Google Scholar
Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Boston: Kluwer Academic Publishers.
Google Scholar
Grishman, R., Hirschman, L., & Nhan, N. T. (1986). Discovery procedures for sublanguage selectional patterns – initial experiments. Computational Linguistics, 12, 205–214.
Google Scholar
Grishman, R., & Sterling, J. (1993). Smoothing of automatically generated selectional constraints. In Human Language Technology: Proceedings of the ARPA Workshop (pp. 254–259). San Francisco: Morgan Kaufmann.
Google Scholar
Hindle, D. (1990). Noun classification from predicate-argument structures. In 28th Annual Meeting of the ACL (pp. 268–275). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Jelinek, F., & Mercer, R. L. (1980). Interpolated estimation of Markov source parameters from sparse data. In Proceedings of the Workshop on Pattern Recognition in Practice. Amsterdam: North Holland.
Google Scholar
Jelinek, F., Mercer, R. L., & Roukos, S. (1992). Principles of lexical language modeling for speech recognition. In S. Furui & M. M. Sondhi (Eds.), Advances in speech signal processing (pp. 651–699). New York: Marcel Dekker.
Google Scholar
Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In ROCLING X International Conference. Tapei, Taiwan: Academia Sinica.
Google Scholar
Karov, Y., & Edelman, S. (1996). Learning similarity-based word sense disambiguation from sparse data. In E. Ejerhed & I. Dagan (Eds.), Fourth Workshop on Very Large Corpora (pp. 42–55). Somerset, New Jersey: Association for Computational Linguistics.
Google Scholar
Katz, S. M. (1987). Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-35(3), 400-401.
Google Scholar
Kneser, R., & Ney, H. (1993). Improved clustering techniques for class-based statistical language modelling. In EUROSPEECH'93 (pp. 973–976). Grenoble, France: European Speech Communication Association.
Google Scholar
Kullback, S. (1959). Information theory and statistics. New York: John Wiley and Sons.
Google Scholar
Lee, L. (1997). Similarity-based approaches to natural language processing. Unpublished doctoral dissertation, Harvard University, Cambridge, Massachusetts.
Google Scholar
Lin, D. (1997). Using syntactic dependency as local context to resolve word sense ambiguity. In 35th Annual Meeting of theACL (pp. 64–71). Somerset, NewJersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Lin, D. (1998). An information theoretic definition of similarity. In Machine Learning: Proceedings of the Fiftheenth International Conference (ICML '98). San Francisco: Morgan Kaufmann.
Google Scholar
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Google Scholar
Luk, A. K. (1995). Statistical sense disambiguation with relatively small corpora using dictionary definitions. In 33rd Annual Meeting of the ACL (pp. 181–188). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to WordNet: an online lexical database. International Journal of Lexicography, 3(4), 235–244.
Google Scholar
Ng, H. T. (1997). Exemplar-based word sense disambiguation: Some recent improvements. In C. Cardie & R. Weischedel (Eds.), Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (pp. 208–213). Somerset, New Jersey: Association for Computational Linguistics.
Google Scholar
Ng, H. T., & Lee, H. B. (1996). Integrating multiple knowledge sources to disambiguate word sense: An exemplar-based approach. In 34th Annual Meeting of the ACL (pp. 40–47). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Paul, D. B. (1991). Experience with a stack decoder-based HMM CSR and back-off n-gram language models. In ARPA Speech and Natural Language Workshop (pp. 284–288). San Francisco: Morgan Kaufmann.
Google Scholar
Pereira, F. C. N., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In 31st Annual Meeting of the ACL (p. 183-190). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar
Rao, C. R. (1982). Diversity: Its measurement, decomposition, apportionment and analysis. Sankyhā: The Indian Journal of Statistics, 44(A), 1-22.
Google Scholar
Resnik, P. (1992). Wordnet and distributional analysis: A class-based approach to lexical discovery. In Workshop on Statistically-based Natural Language Processing Techniques (p. 56-64). Menlo Park, California: AAAI.
Google Scholar
Resnik, P. (1995). Disambiguating noun groupings with respect to WordNet senses. In D. Yarowsky & K. W. Church (Eds.), Third Workshop on Very Large Corpora (pp. 54–68). Somerset, New Jersey: Association for Computational Linguistics.
Google Scholar
Ruge, G. (1992). Experiments on linguistically-based term associations. Information Processing & Management, 28(3), 317-332.
Google Scholar
Saul, L., & Pereira, F. C. N. (1997). Aggregate and mixed-order Markov models for statistical language processing. In C. Cardie & R. Weischedel (Eds.), Second Conference on Empirical Methods in Natural Language Processing (EMNLP-2) (pp. 81–89). Somerset, New Jersey: Association for Computational Linguistics.
Google Scholar
Schütze, H. (1992a). Context space. In R. Goldman (Ed.), Fall Symposium on Probabilistic Approaches to Natural Language (pp. 113–120). Menlo Park, California: AAAI.
Google Scholar
Schütze, H. (1992b). Dimensions of meaning. In Supercomputing '92: Proceedings of the ACM/IEEE Conference (p. 787-796). New York: ACM.
Google Scholar
Schütze, H. (1993). Word space. In S. J. Hanson, J. D. Cowan, & C. L. Giles (Eds.), Advances in Neural Information Processing Systems 5 (pp. 895–902). San Francisco: Morgan Kaufmann.
Google Scholar
Stanfill, C., & Waltz, D. (1986). Toward memory-based reasoning. Communications of the ACM, 29(12), 1213-1228.
Google Scholar
Sugawara, K., Nishimura, M., Toshioka, K., Okochi, M., & Kaneko, T. (1985). Isolated word recognition using hidden Markov models. In ICASSP 85 (pp. 1–4). Piscataway, New Jersey: IEEE.
Google Scholar
Ueberla, J. P. (1994). An extended clustering algorithm for statistical language models (Tech. Rep. No. DRA/CIS(CSE1)/RN94/13). Forum Technology – DRA Malvern.
Ushioda, A. (1996). Hierarchical clustering of words and applications to NLP tasks. In E. Ejerhed & I. Dagan (Eds.), Fourth Workshop on Very Large Corpora (pp. 28–41). Somerset, New Jersey: Association for Computational Linguistics.
Google Scholar
Witten, I. H., & Bell, T. C. (1991). The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4), 1085–1094.
Google Scholar
Yarowsky, D. (1992). Word-sense disambiguation using statistical models of Roget's categories trained on large corpora. In COLING-92 (pp. 454–460). Nantes, France.
Zavrel, J., & Daelemans, W. (1997). Memory-based learning: Using similarity for smoothing. In 35th Annual Meeting of the ACL (pp. 436–443). Somerset, New Jersey: Association for Computational Linguistics. (Distributed by Morgan Kaufmann, San Francisco)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Mathematics and Computer Science, Bar Ilan University, Ramat Gan, 52900, Israel
Ido Dagan
Department of Computer Science, Cornell University, Ithaca, NY, 14853, USA
Lillian Lee
AT&T Labs—Research, 180 Park Ave., Florham Park, NJ, 07932, USA
Fernando C. N. Pereira

Authors

Ido Dagan
View author publications
You can also search for this author in PubMed Google Scholar
Lillian Lee
View author publications
You can also search for this author in PubMed Google Scholar
Fernando C. N. Pereira
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dagan, I., Lee, L. & Pereira, F.C.N. Similarity-Based Models of Word Cooccurrence Probabilities. Machine Learning 34, 43–69 (1999). https://doi.org/10.1023/A:1007537716579

Download citation

Issue Date: February 1999
DOI: https://doi.org/10.1023/A:1007537716579

Similarity-Based Models of Word Cooccurrence Probabilities

Abstract

Article PDF

Similar content being viewed by others

Semantic graph for word disambiguation in machine translation

Approach Toward Word Sense Disambiguation for the English-To-Sanskrit Language Using Naïve Bayesian Classification

Path and Information Content-Based Structural Word Sense Disambiguation

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Similarity-Based Models of Word Cooccurrence Probabilities

Abstract

Article PDF

Similar content being viewed by others

Semantic graph for word disambiguation in machine translation

Approach Toward Word Sense Disambiguation for the English-To-Sanskrit Language Using Naïve Bayesian Classification

Path and Information Content-Based Structural Word Sense Disambiguation

Explore related subjects

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation