Abstract
Within the knowledge discovery in databases (KDD) process, previous phases to data mining consume most of the time spent analysing data. Few research efforts have been carried out in theses steps compared to data mining, suggesting that new approaches and tools are needed to support the preparation of data. As regards, we present in this paper a new methodology of ontology-based KDD adopting a federated approach to database integration and retrieval. Within this model, an ontology-based system called OntoDataClean has been developed dealing with instance-level integration and data preprocessing. Within the OntoDataClean development, a preprocessing ontology was built to store the information about the required transformations. Various biomedical experiments were carried out, showing that data have been correctly transformed using the preprocessing ontology. Although OntoDataClean does not cover every possible data transformation, it suggests that ontologies are a suitable mechanism to improve quality in the various steps of KDD processes.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Rahm, E., Hai Do, H.: Data cleaning: problems and current approaches. IEEE Bulletin of the Technical Committee on Data Engineering 23(4), 3–13 (2001)
Dasu, T., Jonson, T.: Exploratory Data Mining and Data Cleaning. John Wiley & Sons, Chichester (2003)
Weiss, S.M., Indurkhya, N.: Predictive Data Mining: A Practical Guide. Morgan Kaufmann, San Francisco (1998)
Gurwitz, D., Lunshof, J.E., Altman, R.B.: A call for the creation of personalized medicine database. Nature Reviews, Drug Discovery 5, 23–26 (2006)
Fayyad, U., Shapiro, G., Smyth, P.: From Data Mining to Knowledge Discovery in databases. AI Magazine 17, 37–54 (1996)
Sujansky, W.: Heterogeneous Database Integration in Biomedicine. Journal of Biomedical Informatics 34(4), 285–298 (2001)
Maojo, V., García-Remesal, M., Billhardt, H., Alonso-Calvo, R., Pérez-Rey, D., Martín-Sánchez, F.: Designing New Methodologies for Integrating Biomedical Information in Clinical Trials. Methods Inf Med 45(2), 180–185 (2006)
Galhardas, H., Florescu, D., Shasha, D., Simon, E.: AJAX: An Extensible Data Cleaning Tool. In: SIGMOD 2000 Conf. Management of Data, Dallas, p. 590 (2000)
Raman, V., Hellerstein, J.M.: Potter’s Wheel: An Interactive Data Cleaning System. In: VLDB 2001, 27th International Conference on Very Large Databases, Rome, pp. 381–390 (2001)
Gruber, T.R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2), 199–220 (1993)
Silvescu, A., Reinoso-Castillo, J., Honavar, V.: Ontology-Driven information extraction and knowledge acquisition from heterogeneous, distributed, autonomous data sources. In: Proceedings of the IJCAI (2001)
Cespivova, H., Rauch, J., Svatek, V., Kejkula, M., Tomeckova, M.: Roles of Medical Ontology in Association Mining CRISP-DM Cycle. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies (KDO 2004), Pisa (2004)
Pérez-Rey, D., Maojo, V., Garcia-Remesal, M., Alonso-Calvo, R., Billhardt, H., Martin-Sanchez, F., Sousa, A.: ONTOFUSION: Ontology-Based Integration of Genomic and Clinical Databases. Computers in Biology and Medicine 36, 712–730 (2006)
Bizer, C.: D2R MAP - A Database to RDF Mapping Language. In: Proceedings of the International World Wide Web Conference (WWW 2003), Budapest, Hungary (2003)
Köhler, J., Philippi, S., Lange, M.: SEMEDA: ontology based semantic integration of biological databases. Bioinformatics 19(18), 2420–2427 (2003)
http://kaon.semanticweb.org/alphaworld/reverse/ (last accessed September 1, 2006)
Phillips, J., Buchanan, B.G.: Ontology-guided knowledge discovery in databases. In: International Conf. Knowledge Capture Victoria, Canada (2001)
Kedad, Z., Métais, E.: Ontology-based Data Cleaning. In: Andersson, B., Bergholtz, M., Johannesson, P. (eds.) NLDB 2002. LNCS, vol. 2553, Springer, Heidelberg (2002)
Wang, X., Hamilton, H.J., Bither, Y.: An Ontology-Based Approach to Data Cleaning. Technical report. University of Regina. Canada (2005)
Cannataro, M., Hiram Guzzi, P., Mazza, T., Tradigo, G., Veltri, P.: Using Ontologies in PROTEUS for Modeling Proteomics Data Mining Applications. Studies in Health Technology and Informatics 112, 17–26 (2005)
Bernstein, A., Provost, F., Hill, S.: Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification. IEEE Transactions on Knowledge and Data Engineering 17(4), 503–518 (2005)
Gottgtroy, P., Kasabov, N., MacDonell, S.: An ontology driven approach for knowledge discovery in Biomedicine. In: Zhang, C., W. Guesgen, H., Yeap, W.-K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, Springer, Heidelberg (2004)
Svatek, V., Rauch, J., Flek, M.: Ontology-Based Explanation of Discovered Associations in the Domain of Social Reality. In: ECML/PKDD05 Workshop on Knowledge Discovery and Ontologies, Porto (2005)
Euler, T., Scholz, M.: Using Ontologies in a KDD Workbench. In: Workshop on Knowledge Discovery and Ontologies at ECML/PKDD (2004)
McGuinness, D., van Harmelen, F. (eds.): OWL Web Ontology Language Overview (2003), http://www.w3.org/TR/owl-features/ (last accessed September 1, 2006)
Knublauch, H., Fergerson, R.W., Noy, N., Musen, M.A.: The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications. In: Third International Semantic Web Conference (2004)
Kalyanpur, A., Parsia, B., Sirin, E., Cuenca-Grau, B., Hendler, J.: Swoop: A web ontology editing browser. Journal of Web Semantics 4(2) (2005)
Volz, R., Oberle, D., Motik, B., Staab, S.: KAON server - a semantic web management system. In: Proceedings of the 12th International Conference on World Wide Web (WWW 2003). Alternate Tracks - Practice and Experience, Budapest, Hungary (2003)
http://www.es.embnet.org/Services/MolBio/gepas/index.html (last accessed September 1, 2006)
http://www.reactome.org/cgi-bin/frontpage (last accessed September 1, 2006)
http://www.biomerieux.com/servlet/srt/bio/portail/home (last accessed September 1, 2006)
Sanandrés-Ledesma, J.A., Maojo, V., Crespo, J., García-Remesal, M., Gómez de la Cámara, A.: A Performance Comparative Analysis Between Rule Induction-Algorithms and Clustering-Based Constructive Induction Algorithms. In: Application to Rheumatoid Arthritis. ISMBDA (2004)
Martín-Sanchez, F., Maojo, V., López-Campos, G.: Integrating genomics into health information systems. Methods Inf. Med. 41, 25–30 (2002)
Maojo, V., Martin-Sanchez, F.: Bioinformatics: towards new directions for public health. Methods Inf. Med. 43(3), 208–214 (2004)
Maojo, V., Kulikowski, C.A.: Bioinformatics and Medical Informatics: Collaborations on the Road to Genomic Medicine? J. Am. Med. Inform. Assoc. 10(6), 515–522 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Perez-Rey, D., Anguita, A., Crespo, J. (2006). OntoDataClean: Ontology-Based Integration and Preprocessing of Distributed Data. In: Maglaveras, N., Chouvarda, I., Koutkias, V., Brause, R. (eds) Biological and Medical Data Analysis. ISBMDA 2006. Lecture Notes in Computer Science(), vol 4345. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11946465_24
Download citation
DOI: https://doi.org/10.1007/11946465_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68063-5
Online ISBN: 978-3-540-68065-9
eBook Packages: Computer ScienceComputer Science (R0)