Abstract
Multi-relational data mining (MRDM) looks for patterns from a relational database. One of the established approaches to MRDM is propositionalization, characterized by transforming a relational database into a simpler representation, commonly a single table. Another approach that has proven to be effective to address learning problems involving one-to-many relationships between the data is multiple-instance learning. In this paper, we propose a new technique to transform relational data, called WordificationMI, which takes advantage of the multiple-instance learning’s potentialities. This new proposal is based on the bag-of-words representation, proposed in the Wordification methodology, but with the difference that it transforms a relational database into a multiple-instance representation. Additionally, we propose a feature selection method, named MICHI (\(\chi _\mathrm{MI}^{2}\)), for reducing the dimensionality of the datasets obtained with WordificationMI. We also present an empirical evaluation with ten relational databases and four learning techniques that show the effectiveness of the proposed methods.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
All databases used here were obtained from https://relational.fit.cvut.cz except IMDb that was provided by the authors of Wordification.
References
Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-3223-4_6
Ahmed, C.F., Lachiche, N., Charnay, C., El Jelali, S., Braud, A.: Flexible propositionalization of continuous attributes in relational data mining. Expert Syst. Appl. 42(21), 7698–7709 (2015). https://doi.org/10.1016/j.eswa.2015.05.053
Alphonse, É., Rouveirol, C.: Lazy propositionalisation for relational learning. In: Proceedings of the 14th European Conference on Artificial Intelligence, ECAI’00, pp. 256–260. IOS Press, Amsterdam, The Netherlands (2000)
Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). https://doi.org/10.1016/j.artint.2013.06.003
Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artif. Intell. 101(1–2), 285–297 (1998). https://doi.org/10.1016/S0004-3702(98)00034-4
Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 57–64. ACM (2005). http://dl.acm.org/citation.cfm?id=1102359
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011). https://doi.org/10.1145/1961189.1961199
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
De Raedt, L.: Attribute-value learning versus inductive logic programming: the missing links. In: Page, D. (ed.) Inductive Logic Programming. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1446, pp. 1–8. Springer, Berlin, Heidelberg (1998). https://doi.org/10.1007/BFb0027304
De Raedt, L.: Logical and Relational Learning. Cognitive Technologies. Springer, Berlin (2008)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). https://doi.org/10.1016/S0004-3702(96)00034-3
Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, New York (2009). https://doi.org/10.1007/978-0-387-09823-4_46
Ferreira, C.A., Gama, J., Costa, V.S.: Exploring multi-relational temporal databases with a propositional sequence miner. Prog. Artif. Intell. 4(1–2), 11–20 (2015). https://doi.org/10.1007/s13748-015-0065-x
França, M.V.M., Zaverucha, G., d’Avila Garcez, A.S.: Fast relational learning using bottom clause propositionalization with artificial neural networks. Mach. Learn. 94(1), 81–104 (2014). https://doi.org/10.1007/s10994-013-5392-1
Gao, S., Sun, Q.: Exploiting generalized discriminative multiple instance learning for multimedia semantic concept detection. Pattern Recognit. 41(10), 3214–3223 (2008). https://doi.org/10.1016/j.patcog.2008.03.029
García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec), 2677–2694 (2008)
Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Proceedings of the 19th International Conference on Machine Learning, vol. 2, pp. 179–186. Sydney, Australia (2002). http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2002-Gartner-ICML.pdf
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Helma, C., King, R.D., Kramer, S., Srinivasan, A.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001). https://doi.org/10.1093/bioinformatics/17.1.107
Herrera, F., Ventura, S., Bello-Pérez, R., Cornelis, C., Zafra Gómez, A., Sánchez-Tarragó, D., Vluymans, S.: Multiple Instance Learning. Foundations and Algorithms. Springer, Berlin (2016)
Knobbe, A.J.: Multi-relational Data Mining. No. 145 in Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam (2006)
Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: Proceeding of the 5th PKDD, pp. 277–288. Springer (2001). https://doi.org/10.1007/3-540-44797-0_3
Krogel, M.A.: On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto-von-Guericke-Universität Magdeburg, Universitätsbibliothek (2005). http://diglib.uni-magdeburg.de/Dissertationen/2005/markrogel.htm
Krogel, M.A., Wrobel, S.: Transformation-based learning using multirelational aggregation. In: Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP 2001), LNAI, vol. 2157, pp. 142–155. Springer (2001). https://doi.org/10.1007/3-540-44797-0_12
Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5
Lavrač, N., Džeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Hortwood, New York (1994)
Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with LINUS. In: Y. Kodratoff (ed.) Machine Learning—EWSL-91. Lecture Notes in Computer Science, pp. 265–281. Springer, Berlin, Heidelberg (1991). https://doi.org/10.1007/BFb0017020
Lavrač, N., Flach, P.A.: An extended transformation approach to inductive logic programming. ACM Trans. Comput. Log. (TOCL) 2(4), 458–494 (2001)
Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(1), 191–201 (1992). https://doi.org/10.2307/2347628
Lodhi, H., Muggleton, S.: Is mutagenesis still challenging? In: ILP-Late-Breaking Papers, vol. 35 (2005). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.2954&rep=rep1&type=pdf
McGovern, A., Jensen, D.: Chi-squared: a simpler evaluation function for multiple-instance learning. Technical report TR-03-14, Massachusetts University Amherst, Department of Computer Science (2003). http://www.dtic.mil/docs/citations/ADA465740
Melki, G., Cano, A., Ventura, S.: MIRSVM: multi-instance support vector machine with bag representatives. Pattern Recognit. 79, 228–241 (2018). https://doi.org/10.1016/j.patcog.2018.02.007
Michalski, R.S.: Pattern recognition as rule-guided inductive inference. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–2(4), 349–361 (1980). https://doi.org/10.1109/TPAMI.1980.4767034
Muggleton, S.: Inverse entailment and Progol. New Gener. Comput. 13(3–4), 245–286 (1995). https://doi.org/10.1007/BF03037227
Muggleton, S., De Raedt, L., Poole, D., Bratko, I., Flach, P., Inoue, K., Srinivasan, A.: ILP turns 20. Mach. Learn. 86(1), 3–23 (2012). https://doi.org/10.1007/s10994-011-5259-2
Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017
Quinlan, J.R.: C4.5: Programs for Machine Learning. The Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Reutemann, P., Pfahringer, B., Frank, E.: A toolbox for learning from relational data with propositional and multi-instance learners. In: AI 2004: Advances in Artificial Intelligence. Lecture Notes in Computer Science, pp. 1017–1023. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-30549-1_95
Sánchez Tarragó, D., Cornelis, C., Bello, R., Herrera, F.: A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowl. Based Syst. 59, 173–181 (2014). https://doi.org/10.1016/j.knosys.2014.01.008
Srinivasan, A.: The Aleph Manual (2007). https://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html
Srinivasan, A., King, R.D., Muggleton, S.H., Sternberg, M.J.: Carcinogenesis predictions using ILP. In: Inductive Logic Programming, pp. 273–287. Springer (1997)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3ed edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Burlington (2011)
Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Multimedia Information Retrieval, MIR’07, pp. 197–206. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1290082.1290111
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997). http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97comparative.pdf
Zafra, A., Ventura, S.: G3P-MI: a genetic programming algorithm for multiple instance learning. Inf. Sci. 180(23), 4496–4513 (2010). https://doi.org/10.1016/j.ins.2010.07.031
Zafra, A., Ventura, S.: Multi-instance genetic programming for predicting student performance in web based educational environments. Appl. Soft Comput. 12(8), 2693–2706 (2012). https://doi.org/10.1016/j.asoc.2012.03.054
Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1–2), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0
Zhou, Z.H., Zhang, M.L.: Solving multi-instance problems with classifier ensemble based on constructive clustering. Knowledge and Information Systems 11(2), 155–170 (2007). https://doi.org/10.1007/s10115-006-0029-3
Acknowledgements
We gratefully acknowledge Matic Perovšek for his clarifications on the Wordification method and for providing the IMDb database.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research was supported by the Spanish Ministry of Economy and the European Regional Development Fund, Project TIN2017-83445-P. The authors also thank the AUIP and the Council of Economy and Knowledge of the Andalusia Board, as sponsors of the Academic Mobility Scholarship Program of the AUIP.
Rights and permissions
About this article
Cite this article
Quintero-Domínguez, L.A., Morell, C. & Ventura, S. WordificationMI: multi-relational data mining through multiple-instance propositionalization. Prog Artif Intell 8, 375–387 (2019). https://doi.org/10.1007/s13748-019-00186-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13748-019-00186-y