Abstract
Semantically heterogeneous and distributed data sources are quite common in several application domains such as bioinformatics and security informatics. In such a setting, each data source has an associated ontology. Different users or applications need to be able to query such data sources for statistics of interest (e.g., statistics needed to learn a predictive model from data). Because no single ontology meets the needs of all applications or users in every context, or for that matter, even a single user in different contexts, there is a need for principled approaches to acquiring statistics from semantically heterogeneous data. In this paper, we introduce ontology-extended data sources and define a user perspective consisting of an ontology and a set of interoperation constraints between data source ontologies and the user ontology. We show how these constraints can be used to derive mappings from source ontologies to the user ontology. We observe that most of the learning algorithms use only certain statistics computed from data in the process of generating the hypothesis that they output. We show how the ontology mappings can be used to answer statistical queries needed by algorithms for learning classifiers from data viewed from a certain user perspective.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Hendler, J.: Science and the semantic web. Science 299 (2003)
Levy, A.Y.: Logic-based techniques in data integration. In: Logic-based artificial intelligence, pp. 575–595. Kluwer Academic Publishers, Dordrecht (2000)
Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J., Honavar, V.: Information extraction and integration from heterogeneous, distributed, autonomous information sources: A federated, query-centric approach. In: IEEE International Conference on Information Integration and Reuse (2003) (in press)
Caragea, D., Silvescu, A., Honavar, V.: A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems 1 (2004)
Casella, G., Berger, R.: Statistical Inference. Duxbury Press, Belmont (2001)
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
Pearl, J.: Graphical Models for Probabilistic and Causal Reasoning. Cambridge Press, New York (2000)
Jensen, F.V.: Bayesian Networks and Decision Graphs. Springer, Heidelberg (2001)
Quinlan, R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pp. 1300–1309. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Atramentov, A., Leiva, H., Honavar, V.: Learning decision trees from multirelational data. In: Horváth, T., Yamamoto, A. (eds.) ILP 2003. LNCS (LNAI), vol. 2835, pp. 38–56. Springer, Heidelberg (2003)
Silvescu, A., Andorf, C., Dobbs, D., Honavar, V.: Inter-element dependency models for sequence classification. In: ICDM (2004) (submitted)
Agrawal, R., Shafer, J.C.: Parallel Mining of Association Rules. IEEE Transactions On Knowledge And Data Engineering 8, 962–969 (1996)
Bonatti, P., Deng, Y., Subrahmanian, V.: An ontology-extended relational algebra. In: Proceedings of the IEEE Conference on INformation Integration and Reuse, pp. 192–199. IEEE Press, Los Alamitos (2003)
Caragea, D.: Learning from Distributed, Heterogeneous and Autonomous Data Sources. PhD thesis, Department of Computer Sciene, Iowa State University, USA (2004)
Zhang, J., Honavar, V.: Learning naive bayes classifiers from attribute-value taxonomies and partially specified data. In: Proceedings of the Conference on Intelligent System Design and Applications (2004) (in Press)
Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/kleisli and gus: Experiments in integrated access to genomic data sources. IBM Journal 40 (2001)
Eckman, B.: A practitioner’s guide to data management and data integration in bioinformatics. Bioinformatics, 3–74 (2003)
McClean, S., Páircéir, R., Scotney, B., Greer, K.: A Negotiation Agent for Distributed Heterogeneous Statistical Databases. In: SSDBM 2002, pp. 207–216 (2002)
McClean, S., Scotney, B., Greer, K.: A Scalable Approach to Integrating Heterogeneous Aggregate Views of Distributed Databases. IEEE Transactions on Knowledge and Data Engineering (TKDE), 232–235 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Caragea, D., Pathak, J., Honavar, V.G. (2004). Learning Classifiers from Semantically Heterogeneous Data. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3291. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30469-2_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-30469-2_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23662-7
Online ISBN: 978-3-540-30469-2
eBook Packages: Springer Book Archive