Abstract
Bilingual documentation has become a common phenomenon in many official institutions and private companies. In this scenario, the categorization of bilingual text is a useful tool, that can be also applied in the machine translation field. To tackle this classification task, different approaches will be proposed. On the one hand, two finite-state transducer algorithms from the grammatical inference domain will be discussed. On the other hand, the well-known naive Bayes approximation will be presented along with a possible modelization based on n-gram language models. Experiments carried out on a bilingual corpus have demonstrated the adequacy of these methods and the relevance of a second information source in text classification, as supported by classification error rates. Relative reduction of 29% with respect to the best previous results on the monolingual version of the same task has been obtained.
Work supported by the “Agència Valenciana de Ciència i Tecnologia” under grant GRUPOS03/031 and the Spanish project TIC2003-08681-C02-02.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: AAAI 1998 Workshop on Learning for Text Categorization (1998)
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Yang, Y.: An evaluation of statistical approaches to text categorization. Information Retrieval 1, 69–90 (1999)
Picó, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finitestate transducers. Machine Learning 44, 121–142 (2001)
Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 421–437. Springer, Heidelberg (1998)
Vidal, E.: Finite-state speech-to-speech translation. In: Int. Conf. on Acoustics Speech and Signal Processing, Munich, Germany, vol. 1, pp. 111–114 (1997)
Amengual, J.C., Benedí, J.M., Castano, A., Castellanos, A., Jiménez, V.M., Llorens, D., Marzal, A., Pastor, M., Prat, F., Vidal, E., Vilar, J.M.: The EuTrans-I speech translation system. Machine Translation 15, 75–103 (2000)
Oncina, J., García, P., Vidal, E.: Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 448–458 (1993)
Gold, E.M.: Language identification in the limit. Information and Control 10, 447–474 (1967)
Oncina, J., Varó, M.A.: Using domain information during the learning of a subsequential transducer. In: ICGI, Berlin, Germany, pp. 301–312 (1996)
Cubel, E.: Aprendizaje de transductores subsecuenciales estocásticos. Technical Report II-DSIC-B-23/01, Universidad Politécnica de Valencia, Spain (2002)
Och, F.J., Ney, H.: Improved statistical alignment models. In: ACL 2000, Hong Kong, China, pp. 440–447 (2000)
Brown, P.F., Pietra, S.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19, 263–312 (1993)
Viterbi, A.: Error bounds for convolutional codes and a asymtotically optimal decoding algorithm. IEEE Transactions on Information Theory 13, 260–269 (1967)
Witten, I.H., Bell, T.C.: The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Trans. Information Theory 37, 1085–1094 (1991)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modelling. In: Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics, San Francisco, USA, pp. 310–318 (1996)
Juan, A., Vidal, E.: On the use of bernoulli mixture models for text classification. In: Workshop on Pattern Recognition in Information Systems (PRIS 2001), Setúbal, Portugal (2001)
Llorens, D.: Suavizado de autómatas y traductores finitos estocásticos. PhD thesis, Universitat Politècnica de València (2000), Advisor(s): Dr. J. M. Vilar and Dr. F. Casacuberta
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Civera, J., Cubel, E., Juan, A., Vidal, E. (2005). Different Approaches to Bilingual Text Classification Based on Grammatical Inference Techniques. In: Marques, J.S., Pérez de la Blanca, N., Pina, P. (eds) Pattern Recognition and Image Analysis. IbPRIA 2005. Lecture Notes in Computer Science, vol 3523. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11492542_77
Download citation
DOI: https://doi.org/10.1007/11492542_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26154-4
Online ISBN: 978-3-540-32238-2
eBook Packages: Computer ScienceComputer Science (R0)