Abstract
We have applied the inductive learning of statistical decision trees and relaxation labeling to the Natural Language Processing (NLP) task of morphosyntactic disambiguation (Part Of Speech Tagging). The learning process is supervised and obtains a language model oriented to resolve POS ambiguities, consisting of a set of statistical decision trees expressing distribution of tags and words in some relevant contexts. The acquired decision trees have been directly used in a tagger that is both relatively simple and fast, and which has been tested and evaluated on the Wall Street Journal (WSJ) corpus with competitive accuracy. However, better results can be obtained by translating the trees into rules to feed a flexible relaxation labeling based tagger. In this direction we describe a tagger which is able to use information of any kind (n-grams, automatically acquired constraints, linguistically motivated manually written constraints, etc.), and in particular to incorporate the machine-learned decision trees. Simultaneously, we address the problem of tagging when only limited training material is available, which is crucial in any process of constructing, from scratch, an annotated corpus. We show that high levels of accuracy can be achieved with our system in this situation, and report some results obtained when using it to develop a 5.5 million words Spanish corpus from scratch.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
References
Aarts, E.H. & Korst, J.H. (1987). Boltzmann machines and their applications. In J.W. de Bakker, A.J. Nijman & P.C. Treleaven (Eds.). Proceedings PARLE (Parallel Architectures and Languages Europe). Lecture Notes in Computer Science, Vol. 258.
Bahl, L.R., Brown, P.F., DeSouza, P.V., & Mercer, R.L. (1989). A tree-based statistical language model for natural language speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(7), 1001–1008.
Baum, L.E. (1972). An inequality and associated maximization technique in statistical estimation for probabilistic functions of a Markov process. Inequalities, 3, 1–8.
Blum, A. & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. Proceedings of the 11th Annual Conference on Computational Learning Theory, COLT-98 (pp. 92–100). Madison, Wisconsin.
Breiman, L., Friedman, J.H., Olshen, R.A., & Stone, C.J. (1984). Classification and regression trees. The Wadsworth Statistics/Probability Series. Belmont, CA: Wadsworth International Group.
Brill, E. (1992). A simple rule-based part-of-speech tagger. Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLP (pp. 152–155). ACL.
Brill, E. (1994). Some advances in rule-based part-of-speech tagging. Proceedings of the 12th National Conference on Artificial Intelligence, AAAI (pp. 722–727).
Brill, E. (1995). Unsupervised learning of disambiguation rules for part-of-speech tagging. Proceedings of the 3rd Workshop on Very Large Corpora (pp. 1–13). Massachusetts.
Brill, E. & Wu, J. (1998). Classifier combination for improved lexical disambiguation. Proceedings of the Joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING-ACL (pp. 191–195). Montréal, Canada.
Cardie, C. (1994). Domain specific knowledge acquisition for conceptual sentence analysis. Ph.d. Thesis, University of Massachusets. Available as CMPSCI Technical Report 94–74, University of Massachusetts.
Carmona, J., Cervell, S., Màrquez, L., Martí, M., Padró, L., Placer, R., Rodríguez, H., Taulé, M., & Turmo, J. (1998). An environment for morphosyntactic processing of unrestricted spanish text. Proceedings of the 1st International Conference on Language Resources and Evaluation, LREC (pp. 915–922). Spain: Granada.
Chanod, J.-P. & Tapanainen, P. (1995). Tagging French—Comparing a statistical and a constraint-based method. Proceedings of the 7th Conference of the European Chapter of the Association for Computational Linguistics, EACL (pp. 149–156). Dublin, Ireland.
Church, K.W. (1988). A stochastic parts program and noun phrase parser for unrestricted text. Proceedings of the 1st Conference on Applied Natural Language Processing, ANLP (pp. 136–143). ACL.
Cover, T.M. & Thomas, J.A. (Eds.). (1991). Elements of information theory. John Wiley & Sons.
Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. Proceedings of the 3rd Conference on Applied Natural Language Processing, ANLP (pp. 133–140). ACL.
Daelemans, W., Zavrel, J., Berck, P., & Gillis, S. (1996). MBT: A memory-based part-of-speech tagger generator. Proceedings of the 4th Workshop on Very Large Corpora (pp. 14–27). Copenhagen, Denmark.
DeRose, S.J. (1988). Grammatical category disambiguation by statistical optimization. Computational Linguistics, 14, 31–39.
Elworthy, D. (1993). Part-of-speech and phrasal tagging. Working Paper #10, ESPRIT BRA-7315 Acquilex II.
Elworthy, D. (1994). Does Baum-Welch re-estimation help taggers? Proceedings of the 4th Conference on Applied Natural Language Processing, ANLP (pp. 53–58). ACL.
Garside, R., Leech, G., & Sampson, G. (Eds.) (1987). The computational analysis of English: A corpus-based approach. London: Longman.
Greene, B.B. & Rubin, G.M. (1971). Automatic grammatical tagging of English. Technical Report, Department of Linguistics, Brown University.
Halteren, H.v., Zavrel, J., & Daelemans, W. (1998). Improving data driven wordclass tagging by system combination. Proceedings of the Joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING-ACL (pp. 491–497). Montréal, Canada.
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.). (1995). Constraint grammar: A language independent system for parsing unrestricted text. Berlin: Mouton de Gruyter.
Kononenko, I., Šimec, E., & Robnik-Šikonja, M. (1995). Overcoming the myopia of inductive learning algorithms with RELIEFF. Applied Intelligence, 10, 39–55.
Krenn, B. & Samuelsson, C. (1997). The linguists' guide to statistics: Don't panic. Technical Report Universität des Saarlandes. Postscript version of December 19, 1997 at URL: http://coli.uni-sb.de/∼christer.
Krovetz, R. (1997). Homonymy and polysemy in information retrieval. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Joint ACL/EACL (pp. 72–79). Madrid, Spain.
Larrosa, J. & Meseguer, P. (1995a). An optimization-based heuristic for maximal constraint satisfaction. Proceedings of International Conference on Principles and Practice of Constraint Programming (pp. 103–120).
Larrosa, J. & Meseguer, P. (1995b). Constraint satisfaction as global optimization. Proceedings of 14th International Joint Conference on Artificial Intelligence, IJCAI '95 (pp. 579–584).
Lloyd, S.A. (1983). An optimization approach to relaxation labelling algorithms. Image and Vision Computing, 1(2), 85–91.
López de Mántaras, R. (1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6(1), 81–92.
López de Mántaras, R., Cerquides, J., & Garcia, P. (1996). Comparing information-theoretic attribute selection measures: A statistical approach research report 96–16, IIIA. To appear in Artificial Intelligence Communications.
Magerman, D.M. (1996). Learning grammatical structure using statistical decision-trees. Proceedings of the 3rd International Colloquium on Grammatical Inference, ICGI (pp. 1–21). Springer-Verlag Lecture Notes Series in Artificial Intelligence 1147.
Marcus, M.P., Marcinkiewicz, M.A., & Santorini, B. (1993). Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19(2), pp. 313–330.
Màrquez, L. (1999). Part-of-speech tagging: A machine-learning approach based on decision trees. Ph.d. Thesis, Dep. Llenguatges i Sistemes Informàtics. Universitat Politècnica de Catalunya.
Màrquez, L. & Padró, L. (1997). A flexible POS tagger using an automatically acquired language model. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics. Joint ACL/EACL (pp. 238–245). Madrid, Spain.
Màrquez, L., Padró, L.,& Rodríguez, H. (1998). Improving tagging accuracy by voting taggers. Proceedings of the 2nd Conference on Natural Language Processing & Industrial Applications, NLP+IA/TAL+AI (pp. 149–155). New Brunswick, Canada.
Màrquez, L. & Rodríguez, H. (1995). Towards learning a constraint grammar from annotated corpora using decision trees. Working Paper #21, ESPRIT BRA-7315 Acquilex II.
Màrquez, L. & Rodríguez, H. (1997). Automatically acquiring a language model for POS tagging using decision trees. Proceedings of the Second Conference on Recent Advances in Natural Language Processing, RANLP (pp. 27–34). Tzigov Chark, Bulgaria.
Màrquez, L. & Rodríguez, H. (1998). Part-of-speech tagging using decision trees. Proceedings of the 10th European Conference on Machine Learning, ECML (pp. 25–36). Chemnitz, Germany. (Lecture Notes in Artificial Intelligence, Vol. 1398. Claire Nédellec and Céline Rouveirol Eds., Springer.
McCarthy, J.F. & Lehnert, W.G. (1995). Using decision trees for coreference resolution. Proceedings of the 14th International Joint Conference on Artificial Intelligence, IJCAI (pp. 1050–1055).
Merialdo, B. (1994). Tagging english text with a probabilistic model. Computational Linguistics, 20(2), 155–171.
Mooney, R.J. (1996). Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP (pp. 82–91).
Nigam, K., McCallum, A., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. Proceedings of the 15th National Conference on Artificial Intelligence, AAAI-98 Madison, Wisconsin.
Oostdijk, N. (1991). Corpus linguistic and the automatic analysis of English. Amsterdam: Rodopi.
Padró, L. (1996). POS tagging using relaxation labelling. Proceedings of the 16th International Conference on Computational Linguistics, COLING (pp. 877–882). Copenhagen, Denmark.
Padró, L. (1998). A hybrid environment for syntax-semantic tagging. Ph.d. Thesis, Dep. Llenguatges i Sistemes Informàtics. Universitat Politècnica de Catalunya.
Padró, L. & Màrquez, L. (1998). On the evaluation and comparison of taggers: The effect of noise in testing corpora. Proceedings of the Joint 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, COLING-ACL (pp. 997–1002). Montréal, Canada.
Pelillo, M. & Maffione, A. (1994). Using simulated annealing to train relaxation labelling processes. Proceedings of ICANN '94 (pp. 250–253).
Pelillo, M. & Refice, M. (1994). Learning compatibility coefficients for relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(9), 933–945.
Quinlan, J.R. (1993). C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers Inc.
Quinlan, J.R. (1986). Induction of decision trees. Machine Learning, 1, 81–106.
Ratnaparkhi, A. (1996). Amaximumentropy part-of-speech tagger. Proceedings of the 1st Conference on Empirical Methods in Natural Language Processing, EMNLP.
Ratnaparkhi, A. (1997). A simple introduction to maximum entropy models for natural language processing. Technical Report 97–08, Institute for Research in Cognitive Science, University of Pennsylvania.
Richards, J., Landgrebe, D., & Swain, P. (1981). On the accuracy of pixel relaxation labelling. IEEE Transactions on Systems, Man and Cybernetics, 11(4), 303–309.
Ristad, E. & Thomas, R.G. (1996). Nonuniform Markov models. Proceedings of the International Conference on Acoustics, Speech and Signal Processing, Munich, Germany.
Ristad, E.S. (1997). Maximum entropy modeling for natural language. Joint ACL/EACL Tutorial Program, Madrid, Spain.
Rosenfeld, R. (1994). Adaptive statistical language modelling: A maximum entropy approach. Ph.d. Thesis, School of Computer Science, Carnegie Mellon University.
Rosenfeld, R., Hummel, R., & Zucker, S. (1976). Scene labelling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 420–433.
Samuelsson, C., Tapanainen, P., & Voutilainen, A. (1996). Inducing constraint grammars. Proceedings of the 3rd International Colloquium on Grammatical Inference, ICGI (pp. 146–155). Montpellier, France.
Samuelsson, C. & Voutilainen, A. (1997). Comparing a linguistic and a stochastic tagger. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (pp. 246–253). Madrid, Spain.
Saul, L. & Pereira, F. (1997). Aggregate and mixed-order Markov models for statistical language processing. Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, EMNLP.
Schmid, H. (1994a). Part-of-speech tagging with neural networks. Proceedings of the 15th International Conference on Computational Linguistics, COLING (pp. 172–176). Kyoto, Japan.
Schmid, H. (1994b). Probabilistic part-of-speech tagging using decision trees. Proceedings of the Conference on New Methods in Language Processing (pp. 44–49). Manchester, UK.
Southwell, R. (1940). Relaxation methods in engineering science. Clarendon.
Torras, C. (1989). Relaxation and neural learning: Points of convergence and divergence. Journal of Parallel and Distributed Computing, 6, 217–244.
Voutilainen, A. (1994). Three studies of grammar-based surface parsing on unrestricted English text. Ph.d. Thesis, Department of General Linguistics. University of Helsinki.
Voutilainen, A. & Padró, L. (1997). Developing a hybrid NP parser. Proceedings of the 5th Conference on Applied Natural Language Processing, ANLP (pp. 80–97). Washington DC: ACL.
Waltz, D. (1975). Understanding line drawings of scenes with shadows: Psychology of Computer Vision. New York: McGraw-Hill.
Weischedel, R., Schwartz, R., Palmucci, J., Meteer, M., & Ramshaw, L. (1993). Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics, 19(2), 359–382.
Wilks, Y. & Stevenson, M. (1997). Combining independent knowledge sources for word sense disambiguation. Proceedings of the Second Conference on Recent Advances in Natural Language Processing, RANLP (pp. 1–7), Tzigov Chark, Bulgaria.
Zhou, X. & Dillon, T.S. (1991). A statistical-heuristic feature selection criterion for decision tree induction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(8), 834–841.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Màrquez, L., Padró, L. & Rodríguez, H. A Machine Learning Approach to POS Tagging. Machine Learning 39, 59–91 (2000). https://doi.org/10.1023/A:1007673816718
Issue Date:
DOI: https://doi.org/10.1023/A:1007673816718