WordificationMI: multi-relational data mining through multiple-instance propositionalization

Luis A. Quintero-Domínguez^1,2,
Carlos Morell² &
Sebastián Ventura ORCID: orcid.org/0000-0003-4216-6378³

204 Accesses
1 Altmetric
Explore all metrics

Abstract

Multi-relational data mining (MRDM) looks for patterns from a relational database. One of the established approaches to MRDM is propositionalization, characterized by transforming a relational database into a simpler representation, commonly a single table. Another approach that has proven to be effective to address learning problems involving one-to-many relationships between the data is multiple-instance learning. In this paper, we propose a new technique to transform relational data, called WordificationMI, which takes advantage of the multiple-instance learning’s potentialities. This new proposal is based on the bag-of-words representation, proposed in the Wordification methodology, but with the difference that it transforms a relational database into a multiple-instance representation. Additionally, we propose a feature selection method, named MICHI (\(\chi _\mathrm{MI}^{2}\)), for reducing the dimensionality of the datasets obtained with WordificationMI. We also present an empirical evaluation with ten relational databases and four learning techniques that show the effectiveness of the proposed methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Indexing Multiple-Instance Objects

Relational and Semantic Data Mining

Grafting for combinatorial binary model using frequent itemset mining

Article Open access 28 October 2019

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

All databases used here were obtained from https://relational.fit.cvut.cz except IMDb that was provided by the authors of Wordification.

References

Aggarwal, C.C., Zhai, C.: A survey of text classification algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 163–222. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-3223-4_6
Chapter Google Scholar
Ahmed, C.F., Lachiche, N., Charnay, C., El Jelali, S., Braud, A.: Flexible propositionalization of continuous attributes in relational data mining. Expert Syst. Appl. 42(21), 7698–7709 (2015). https://doi.org/10.1016/j.eswa.2015.05.053
Article Google Scholar
Alphonse, É., Rouveirol, C.: Lazy propositionalisation for relational learning. In: Proceedings of the 14th European Conference on Artificial Intelligence, ECAI’00, pp. 256–260. IOS Press, Amsterdam, The Netherlands (2000)
Amores, J.: Multiple instance classification: review, taxonomy and comparative study. Artif. Intell. 201, 81–105 (2013). https://doi.org/10.1016/j.artint.2013.06.003
Article MathSciNet MATH Google Scholar
Blockeel, H., De Raedt, L.: Top-down induction of first-order logical decision trees. Artif. Intell. 101(1–2), 285–297 (1998). https://doi.org/10.1016/S0004-3702(98)00034-4
Article MathSciNet MATH Google Scholar
Blockeel, H., Page, D., Srinivasan, A.: Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 57–64. ACM (2005). http://dl.acm.org/citation.cfm?id=1102359
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011). https://doi.org/10.1145/1961189.1961199
Article Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995). https://doi.org/10.1007/BF00994018
Article MATH Google Scholar
De Raedt, L.: Attribute-value learning versus inductive logic programming: the missing links. In: Page, D. (ed.) Inductive Logic Programming. Lecture Notes in Computer Science (Lecture Notes in Artificial Intelligence), vol. 1446, pp. 1–8. Springer, Berlin, Heidelberg (1998). https://doi.org/10.1007/BFb0027304
De Raedt, L.: Logical and Relational Learning. Cognitive Technologies. Springer, Berlin (2008)
Book MATH Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Dietterich, T.G., Lathrop, R.H., Lozano-Pérez, T.: Solving the multiple instance problem with axis-parallel rectangles. Artif. Intell. 89(1), 31–71 (1997). https://doi.org/10.1016/S0004-3702(96)00034-3
Article MATH Google Scholar
Džeroski, S.: Relational data mining. In: Maimon, O., Rokach, L. (eds.) Data Mining and Knowledge Discovery Handbook, pp. 887–911. Springer, New York (2009). https://doi.org/10.1007/978-0-387-09823-4_46
Chapter Google Scholar
Ferreira, C.A., Gama, J., Costa, V.S.: Exploring multi-relational temporal databases with a propositional sequence miner. Prog. Artif. Intell. 4(1–2), 11–20 (2015). https://doi.org/10.1007/s13748-015-0065-x
Article Google Scholar
França, M.V.M., Zaverucha, G., d’Avila Garcez, A.S.: Fast relational learning using bottom clause propositionalization with artificial neural networks. Mach. Learn. 94(1), 81–104 (2014). https://doi.org/10.1007/s10994-013-5392-1
Article MathSciNet Google Scholar
Gao, S., Sun, Q.: Exploiting generalized discriminative multiple instance learning for multimedia semantic concept detection. Pattern Recognit. 41(10), 3214–3223 (2008). https://doi.org/10.1016/j.patcog.2008.03.029
Article MATH Google Scholar
García, S., Herrera, F.: An extension on “statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. J Mach Learn Res 9(Dec), 2677–2694 (2008)
MATH Google Scholar
Gärtner, T., Flach, P.A., Kowalczyk, A., Smola, A.J.: Multi-instance kernels. In: Proceedings of the 19th International Conference on Machine Learning, vol. 2, pp. 179–186. Sydney, Australia (2002). http://sci2s.ugr.es/keel/pdf/algorithm/congreso/2002-Gartner-ICML.pdf
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newslett. 11(1), 10–18 (2009)
Article Google Scholar
Helma, C., King, R.D., Kramer, S., Srinivasan, A.: The predictive toxicology challenge 2000–2001. Bioinformatics 17(1), 107–108 (2001). https://doi.org/10.1093/bioinformatics/17.1.107
Article Google Scholar
Herrera, F., Ventura, S., Bello-Pérez, R., Cornelis, C., Zafra Gómez, A., Sánchez-Tarragó, D., Vluymans, S.: Multiple Instance Learning. Foundations and Algorithms. Springer, Berlin (2016)
Book MATH Google Scholar
Knobbe, A.J.: Multi-relational Data Mining. No. 145 in Frontiers in Artificial Intelligence and Applications. IOS Press, Amsterdam (2006)
Google Scholar
Knobbe, A.J., de Haas, M., Siebes, A.: Propositionalisation and aggregates. In: Proceeding of the 5th PKDD, pp. 277–288. Springer (2001). https://doi.org/10.1007/3-540-44797-0_3
Krogel, M.A.: On propositionalization for knowledge discovery in relational databases. PhD thesis, Otto-von-Guericke-Universität Magdeburg, Universitätsbibliothek (2005). http://diglib.uni-magdeburg.de/Dissertationen/2005/markrogel.htm
Krogel, M.A., Wrobel, S.: Transformation-based learning using multirelational aggregation. In: Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP 2001), LNAI, vol. 2157, pp. 142–155. Springer (2001). https://doi.org/10.1007/3-540-44797-0_12
Kuželka, O., Železný, F.: Block-wise construction of tree-like relational features with monotone reducibility and redundancy. Mach. Learn. 83(2), 163–192 (2011). https://doi.org/10.1007/s10994-010-5208-5
Article MathSciNet MATH Google Scholar
Lavrač, N., Džeroski, S.: Inductive Logic Programming: Techniques and Applications. Ellis Hortwood, New York (1994)
MATH Google Scholar
Lavrač, N., Džeroski, S., Grobelnik, M.: Learning nonrecursive definitions of relations with LINUS. In: Y. Kodratoff (ed.) Machine Learning—EWSL-91. Lecture Notes in Computer Science, pp. 265–281. Springer, Berlin, Heidelberg (1991). https://doi.org/10.1007/BFb0017020
Lavrač, N., Flach, P.A.: An extended transformation approach to inductive logic programming. ACM Trans. Comput. Log. (TOCL) 2(4), 458–494 (2001)
Article MATH Google Scholar
Le Cessie, S., Van Houwelingen, J.C.: Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 41(1), 191–201 (1992). https://doi.org/10.2307/2347628
Article MATH Google Scholar
Lodhi, H., Muggleton, S.: Is mutagenesis still challenging? In: ILP-Late-Breaking Papers, vol. 35 (2005). http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.115.2954&rep=rep1&type=pdf
McGovern, A., Jensen, D.: Chi-squared: a simpler evaluation function for multiple-instance learning. Technical report TR-03-14, Massachusetts University Amherst, Department of Computer Science (2003). http://www.dtic.mil/docs/citations/ADA465740
Melki, G., Cano, A., Ventura, S.: MIRSVM: multi-instance support vector machine with bag representatives. Pattern Recognit. 79, 228–241 (2018). https://doi.org/10.1016/j.patcog.2018.02.007
Article Google Scholar
Michalski, R.S.: Pattern recognition as rule-guided inductive inference. IEEE Trans. Pattern Anal. Mach. Intell. PAMI–2(4), 349–361 (1980). https://doi.org/10.1109/TPAMI.1980.4767034
Article MATH Google Scholar
Muggleton, S.: Inverse entailment and Progol. New Gener. Comput. 13(3–4), 245–286 (1995). https://doi.org/10.1007/BF03037227
Article Google Scholar
Muggleton, S., De Raedt, L., Poole, D., Bratko, I., Flach, P., Inoue, K., Srinivasan, A.: ILP turns 20. Mach. Learn. 86(1), 3–23 (2012). https://doi.org/10.1007/s10994-011-5259-2
Article MathSciNet MATH Google Scholar
Perovšek, M., Vavpetič, A., Kranjc, J., Cestnik, B., Lavrač, N.: Wordification: propositionalization by unfolding relational data into bags of words. Expert Syst. Appl. 42(17), 6442–6456 (2015). https://doi.org/10.1016/j.eswa.2015.04.017
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. The Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993)
Google Scholar
Reutemann, P., Pfahringer, B., Frank, E.: A toolbox for learning from relational data with propositional and multi-instance learners. In: AI 2004: Advances in Artificial Intelligence. Lecture Notes in Computer Science, pp. 1017–1023. Springer, Berlin (2004). https://doi.org/10.1007/978-3-540-30549-1_95
Sánchez Tarragó, D., Cornelis, C., Bello, R., Herrera, F.: A multi-instance learning wrapper based on the Rocchio classifier for web index recommendation. Knowl. Based Syst. 59, 173–181 (2014). https://doi.org/10.1016/j.knosys.2014.01.008
Article Google Scholar
Srinivasan, A.: The Aleph Manual (2007). https://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html
Srinivasan, A., King, R.D., Muggleton, S.H., Sternberg, M.J.: Carcinogenesis predictions using ILP. In: Inductive Logic Programming, pp. 273–287. Springer (1997)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques, 3ed edn. Morgan Kaufmann Series in Data Management Systems. Morgan Kaufmann, Burlington (2011)
Google Scholar
Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Multimedia Information Retrieval, MIR’07, pp. 197–206. ACM, New York, NY, USA (2007). https://doi.org/10.1145/1290082.1290111
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: Proceedings of 14th International Conference on Machine Learning, pp. 412–420 (1997). http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97comparative.pdf
Zafra, A., Ventura, S.: G3P-MI: a genetic programming algorithm for multiple instance learning. Inf. Sci. 180(23), 4496–4513 (2010). https://doi.org/10.1016/j.ins.2010.07.031
Article Google Scholar
Zafra, A., Ventura, S.: Multi-instance genetic programming for predicting student performance in web based educational environments. Appl. Soft Comput. 12(8), 2693–2706 (2012). https://doi.org/10.1016/j.asoc.2012.03.054
Article Google Scholar
Železný, F., Lavrač, N.: Propositionalization-based relational subgroup discovery with RSD. Mach. Learn. 62(1–2), 33–63 (2006). https://doi.org/10.1007/s10994-006-5834-0
Article Google Scholar
Zhou, Z.H., Zhang, M.L.: Solving multi-instance problems with classifier ensemble based on constructive clustering. Knowledge and Information Systems 11(2), 155–170 (2007). https://doi.org/10.1007/s10115-006-0029-3
Article Google Scholar

Download references

Acknowledgements

We gratefully acknowledge Matic Perovšek for his clarifications on the Wordification method and for providing the IMDb database.

Author information

Authors and Affiliations

Departamento de Ingeniería Informática, Universidad de Sancti Spíritus “José Martí Pérez”, Comandante Fajardo S/N, Sancti Spíritus, Cuba
Luis A. Quintero-Domínguez
Computer Science Department, Universidad Central “Marta Abreu” de Las Villas, Santa Clara, Cuba
Luis A. Quintero-Domínguez & Carlos Morell
Department of Computer Science and Numerical Analysis, University of Cordoba, Córdoba, Spain
Sebastián Ventura

Authors

Luis A. Quintero-Domínguez
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Morell
View author publications
You can also search for this author in PubMed Google Scholar
Sebastián Ventura
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastián Ventura.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This research was supported by the Spanish Ministry of Economy and the European Regional Development Fund, Project TIN2017-83445-P. The authors also thank the AUIP and the Council of Economy and Knowledge of the Andalusia Board, as sponsors of the Academic Mobility Scholarship Program of the AUIP.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quintero-Domínguez, L.A., Morell, C. & Ventura, S. WordificationMI: multi-relational data mining through multiple-instance propositionalization. Prog Artif Intell 8, 375–387 (2019). https://doi.org/10.1007/s13748-019-00186-y

Download citation

Received: 21 February 2019
Accepted: 25 April 2019
Published: 13 May 2019
Issue Date: 01 September 2019
DOI: https://doi.org/10.1007/s13748-019-00186-y

WordificationMI: multi-relational data mining through multiple-instance propositionalization

Abstract

Access this article

Subscribe and save

Buy Now